**André Platzer Geoff Sutcliffe (Eds.)**

# LNAI 12699

# **Automated Deduction – CADE 28**

**28th International Conference on Automated Deduction Virtual Event, July 12–15, 2021 Proceedings**

# Lecture Notes in Artificial Intelligence 12699

# Subseries of Lecture Notes in Computer Science

Series Editors

Randy Goebel University of Alberta, Edmonton, Canada Yuzuru Tanaka Hokkaido University, Sapporo, Japan Wolfgang Wahlster DFKI and Saarland University, Saarbrücken, Germany

Founding Editor

Jörg Siekmann DFKI and Saarland University, Saarbrücken, Germany More information about this subseries at http://www.springer.com/series/1244

André Platzer • Geoff Sutcliffe (Eds.)

# Automated Deduction – CADE 28

28th International Conference on Automated Deduction Virtual Event, July 12–15, 2021 Proceedings

Editors André Platzer Carnegie Mellon University Pittsburgh, PA, USA

Geoff Sutcliffe University of Miami Coral Gables, FL, USA

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Artificial Intelligence ISBN 978-3-030-79875-8 ISBN 978-3-030-79876-5 (eBook) https://doi.org/10.1007/978-3-030-79876-5

LNCS Sublibrary: SL7 – Artificial Intelligence

© The Editor(s) (if applicable) and The Author(s) 2021. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

# Preface

This volume contains the proceedings of the 28th International Conference on Automated Deduction (CADE-28). CADE is the major forum for the presentation of research in all aspects of automated deduction, including foundations, applications, implementations, and practical experience. CADE-28 was hosted by Carnegie Mellon University, Pittsburgh, USA, 11–16 July 2021, but held online due to the COVID-19 pandemic. CADE-28 emphasized the breadth of topics that are of interest, including applications in and beyond STEM, and the use/contribution of automated deduction in AI.

The Program Committee (PC) accepted 36 papers (29 full papers and 7 system descriptions) out of 76 submissions (59 full papers, 4 short papers, and 13 system descriptions). Each submission was reviewed by at least three Program Committee members or their external reviewers. The criteria for evaluation were originality and significance, technical quality, comparison with related work, quality of presentation, and reproducibility of experiments.

The program of the conference included four invited talks:


The conference hosted several workshops, tutorials, and competitions:


In addition to the best paper awards, three CADE awards were presented at the conference:


Thanks go to the many people without whom the conference would not have been possible - the authors, participants, invited spakers, members of the PC and their subreviewers, conference chairs, local organizers, the workshop/tutorial/competitions chair, the publicity chair, the CADE trustees, the board of the Association for Automated Reasoning, the staff at Springer, and the EasyChair team. CADE-28 gratefully received support from the Automated Reasoning Group at Amazon Web Services, The Journal of Artificial Intelligence, Imandra Inc., and Springer.

July 2021 André Platzer Geoff Sutcliffe

# Organization

#### Program Committee

Peter Baumgartner CSIRO, Australia Nikolaj Bjorner Microsoft Research, USA Stéphane Demri CNRS, LMF, France Nathan Fulton IBM, USA Dejan Jovanović SRI International, USA Assia Mahboubi Inria, France Andrew Reynolds University of Iowa, USA Stephan Schulz DHBW Stuttgart, Germany

Bernhard Beckert Karlsruhe Institute of Technology, Germany Christoph Benzmüller Freie Universität Berlin, Germany Armin Biere Johannes Kepler University, Austria Jasmin Blanchette Vrije Universiteit Amsterdam, The Netherlands Maria Paola Bonacina Università degli Studi di Verona, Italy Agata Ciabattoni Vienna University of Technology, Austria Koen Claessen Chalmers University of Technology, Sweden Hans de Nivelle Nazarbayev University, Kazakhstan Huimin Dong Sun Yat-sen University, China Gilles Dowek Inria and ENS Paris-Saclay, France Mnacho Echenim Grenoble Alpes University, France Pascal Fontaine Université de Liège, Belgium Silvio Ghilardi Università degli Studi di Milano, Italy Jürgen Giesl RWTH Aachen University, Germany Rajeev Gore The Australian National University, Australia Nao Hirokawa Japan Advanced Institute of Science and Technology, Japan Moa Johansson Chalmers University of Technology, Sweden Cezary Kaliszyk University of Innsbruck, Austria Laura Kovacs Vienna University of Technology, Austria Tomer Libal American University of Paris, France Cláudia Nalon University of Brasília, Brazil Vivek Nigam Huawei Technologies, China Tobias Nipkow Technical University of Munich, Germany Frank Pfenning Carnegie Mellon University, USA Giles Reger University of Manchester, UK Philipp Rümmer Uppsala University, Sweden Katsuhiko Sano Hokkaido University, Japan Renate A. Schmidt University of Manchester, UK Viorica Sofronie-Stokkermans University Koblenz-Landau, Germany

Sophie Tourret Inria, France

## Subreviewers

Ruba Alassaf Johannes Åman Pohjola Paolo Baldi Haniel Barbosa Lee Barnett Filip Bártek Ahmed Bhayat Lionel Blatter Pierre Boutry Martin Bromberger James Brotherston Claudia Cauli Anupam Das Jeremy Dawson Emanuele De Angelis Stefan Dollase Manuel Eberl Santiago Escobar Michael Färber Mathias Fleury Carsten Fuhs Thibault Gauthier Alessandro Gianola Yuri Gil Dantas Christoph Haase Ludovic Henrio Jera Hensel Stepan Holub Ullrich Hustadt Jan Jakubuv Peter Jipsen Daniela Kaufmann Daisuke Kimura

Martin Suda Czech Technical University in Prague, Czech Republic Tanel Tammet Tallinn University of Technology, Estonia Christian Urban King's College London, UK Uwe Waldmann Max Planck Institute for Informatics, Germany Yoni Zohar Stanford University, USA

> Michael Kirsten Patrick Koopmann Hanna Lachnitt Florian Lanzinger Dominique Larchey-Wendling Jonathan Laurent Alexander Leitsch Chencheng Liang Andrea Mazzullo Aart Middeldorp Dale Miller Julien Narboux Ulf Norell Mizuhito Ogawa Miroslav Olšák Hitoshi Omori Jens Otten Xavier Parent Dirk Pattinson Lawrence Paulson Nicolas Peltier Michael Rawson Adrian Rebola Pardo Giselle Reis Simon Robillard Jonas Schiffl Claudia Schon Hans-Jörg Schurr Ying Sheng Jonni Virtema Alexander Weigl Emre Yolcu Marco Ziener

# Conference Chairs


# Local Organizers


# Workshop/Tutorial/Competitions Chair


# Publicity Chair


# Board of Trustees of CADE Inc.


# Board of the Association for Automated Reasoning

Philipp Rümmer (Secretary) Uppsala University, Sweden Sophie Tourret (Newsletter Editor)

Christoph Benzmüller (CADE) Freie Universität Berlin, Germany Uli Furbach (Vice-president) Universität Koblenz-Landau, Germany Jürgen Giesl (CADE) RWTH Aachen University, Germany Inria, France

# Contents

#### Invited Talks


#### Logical Foundations



Clark Barrett, and Cesare Tinelli


#### Implementation and Application


#### xiv Contents


# **Invited Talks**

# **Non-well-founded Deduction for Induction and Coinduction**

Liron Cohen

cliron@cs.bgu.ac.il https://www.cs.bgu.ac.il/~cliron/ Dept. of Computer Science, Ben-Gurion University, Be'er Sheva, Israel

**Abstract.** Induction and coinduction are both used extensively within mathematics and computer science. Algebraic formulations of these principles make the duality between them apparent, but do not account well for the way they are commonly used in deduction. Generally, the formalization of these reasoning methods employs inference rules that express a general *explicit* (co)induction scheme. Non-well-founded proof theory provides an alternative, more robust approach for formalizing *implicit* (co)inductive reasoning. This approach has been extremely successful in recent years in supporting implicit inductive reasoning, but is not as well-developed in the context of coinductive reasoning. This paper reviews the general method of non-well-founded proofs, and puts forward a concrete natural framework for (co)inductive reasoning, based on (co)closure operators, that offers a concise framework in which inductive and coinductive reasoning are captured as we intuitively understand and use them. Through this framework we demonstrate the enormous potential of non-well-founded deduction, both in the foundational theoretical exploration of (co)inductive reasoning and in the provision of proof support for (co)inductive reasoning within (semi-)automated proof tools.

## **1 Introduction**

The principle of induction is a key technique in mathematical reasoning that is widely used in computer science for reasoning about recursive data types (such as numbers or lists) and computations. Its dual principle—the principle of coinduction [49,69,70]—is not as widespread, and has only been investigated for a few decades, but still has many applications in computer science, e.g. [42,56,39,52,82,55,57]. It is mainly used for reasoning about coinductive data types (codata), which are data structures containing non-well-founded elements, e.g., infinite streams or trees. One prominent application of coinduction is as a generic formalism for reasoning about state-based dynamical systems, which typically contain some sort of circularity. It is key in proofs of the bisimulation of state-transition systems (i.e., proving that two systems are behaviorally equivalent) and is a primary method for reasoning about concurrent systems [53].

A duality between induction and coinduction is observed when formulating them within an algebraic, or categorical, framework, e.g., [71,64,70,69]. Whereas induction corresponds to a least-fixed-point semantics (or initial algebras), coinduction corresponds to a greatest-fixed-point semantics (or final coalgebras). However, such an algebraic formulation does not account well for the way these principles are commonly used in deduction, where they are usually applied in different ways: induction to prove properties of certain collections, and coinduction to show equivalences between processes and systems.

Since the principle of induction is so well-known, induction methods are relatively well-developed. They are available in most (semi-)automated deduction systems, and tools for the formal verification of software and hardware such as theorem provers. Generally, implementations of the induction method employ one or more inference rules that express a general explicit induction scheme that holds for the elements being reasoned over. That is, to prove that some property, say P, holds for all elements in an inductively defined set, we (i) show that it holds for the initial elements, and (ii) show that P is preserved in the inductive generation of new elements. A side-effect of such implementations is that in applying inductive reasoning, the induction invariant must be provided explicitly. While advanced provers offer powerful facilities for producing and manipulating inductive goals, this still poses a major automation challenge. This formalization of the induction principle uses the classical notion of formal proofs invoked in standard theorem provers. There, proofs are well-founded trees, starting at the goal and reaching axioms while proceeding by applications of inference rules.

A more robust and natural alternative formalization of inductive reasoning is implicit induction, which avoids the need for explicitly specifying induction invariants. This form of reasoning is enabled by extending the standard notion of well-founded, finite proof trees into non-well-founded proof trees, where the presence of cycles can be exploited instead of cluttering the proof with explicit inductive invariants. For example, to prove P(x) using implicit induction, one repeatedly decomposes the goal into subgoals that are either provable in the standard way (via well-founded subtrees) or reducible back to P(x). This alternative has deep historic roots (originating in Fermat's infinite-descent method) and recently has seen a flourishing of its proof theory via cyclic proof systems.

Non-well-founded proof theory and its cyclic fragment (comprising only of finite and regular proofs) have been extremely successful in recent years in supporting implicit inductive reasoning. For one, the non-well-founded approach has been used to obtain (optimal) cut-free completeness results for highly expressive logics, such as the μ-calculus [3,35,34,37] and Kleene algebra [32,33], providing further evidence of its utility for automation. Other works focus on the structural proof theory of non-well-founded systems, where these promote additional insights into standard proof-theoretical questions by separating local steps of deductive inference from global well-foundedness arguments. In particular, syntactic cut elimination for non-well-founded systems has been studied extensively in the linear logic settings [41,7]. Much work has been devoted to the formal study of explicit versus implicit forms of induction in various logical settings including the μ-calculus [72,75,7,62], systems for arithmetics [74,31], and first-order logics with inductive definitions [19,14,19]. The latter offers a system parameterized by a set

of inductive predicates with associated rules, rather than a single rule for induction as with the others. The cyclic machinery has also been used to effectively search for proofs of inductive properties and automatically verify properties of inductive programs, especially in the context of separation logic [78,68,16,17,18].

Unlike induction, the coinduction principle has not been so fully and naturally incorporated into major theorem provers, but it has gained importance and attention in recent years. As noted by Basold, Komendantskaya, and Li: "it may be surprising that automated proof search for coinductive predicates in first-order logic does not have a coherent and comprehensive theory, even after three decades..." [8]. Automated provers, to the best of our knowledge, currently do not offer any support for coinduction, and while coinductive data types have been implemented in interactive theorem provers (a.k.a. proof assistants) such as Coq [11,47,83], Nuprl [30], Isabelle [13,81,12,38], Agda [1], Lean [4], and Dafny [54], the treatment of these forms of data is often partial. These formalizations, as well as other formal frameworks that support the combination of induction and coinduction, e.g., [80,61,6,46], generally rely on making (co)invariants explicit within proofs. But just as inductive reasoning is naturally captured via proof cycles, cyclic systems seem to be particularly well-suited for also encompassing the implicit notion of coinduction. Nonetheless, while non-well-founded proof theory has been very successful in supporting inductive reasoning, this proof method has not been equally incorporated and explored in the context of coinductive reasoning. Some notable cyclic systems that do support coinduction in various settings include [67,58,72,36,2]. Another related framework is that of Coq's parameterized coinduction [47,83], which offers a different, but highly related, implicit nature of proofs (based on patterns within parameters, rather than within proof sequents).

This paper reviews the general method of non-well-founded proof theory, focusing on its use in capturing both implicit inductive and coinductive reasoning. Throughout the paper we focus on one very natural and simple logical framework to demonstrate the benefits of the approach—that of the transitive (co)closure logic. This logic offers a succinct and intuitive dual treatment to induction and coinduction, while still supporting their common practices in deduction, making it great for prototyping. More specifically, it has the benefits of (1) conciseness: no need for a separate language or interpretation for definitions, nor for fully general least/greatest-fixed-point operators; (2) intuitiveness: the concept of transitive closure is basic, and the dual closure is equally simple to grasp, resulting in a simpler metatheory; (3) illumination: similarities, dualities, and differences between induction and coinduction are clearly demonstrated; and (4) naturality: local reasoning is rudimentary, and the global structure of proofs directly reflects higher-level reasoning. The framework presented is based on ongoing work by Reuben Rowe and the author, some of which can be found in [26,29,28,23]. We conclude the paper by briefly discussing two major open research questions in the field of non-well-founded theory: namely, the need for a user-friendly implementation of the method into modern proof assistants, in order to make it applicable and to facilitate advancements in automated proof search and program

verification, and the task of determining the precise relationship between systems for cyclic reasoning and standard systems for explicit reasoning.

# **2 The Principles of Induction and Coinduction**

A duality between the induction principle and the coinduction principle is clearly observed when formulating them within an algebraic, or categorical, framework. This section reviews such a general algebraic formalization (Section 2.1), and then presents transitive (co)closure logic, which will serve as our running example throughout this paper as it provides simple, yet very intuitive, inductive and coinductive notions (Section 2.2).

## **2.1 Algebraic Formalization of Induction and Coinduction**

Both the induction principle and the coinduction principle are usually defined algebraically via the concept of fixed points, where the definitions vary in different domains such as order theory, set theory or category theory. We opt here for a set-theoretical representation for the sake of simplicity, but more general representations, e.g., in a categorical setting, are also well-known [71].

Let Ψ : ℘(D) → ℘(D) be a monotone operator on sets for some fixed domain D (where ℘(D) denotes the power set of D). Since (℘(D), ⊆) is a complete lattice, by the Knaster–Tarski theorem, both the least-fixed point and greatest-fixed point of Ψ exist. The least-fixed point (μ) is given by the intersection of all its prefixed points—that is, those sets A satisfying Ψ(A) ⊆ A—and, dually, the greatest-fixed point (ν) is given by the union of all its postfixed points—that is, those sets A satisfying A ⊆ Ψ(A). These definitions naturally yield corresponding induction and coinduction principles.


The induction principle states that μ(Ψ) is contained in every Ψ-closed set, where a set A is called Ψ-closed if, for all a ∈ A and b ∈ D, (a, b) ∈ Ψ(A) implies b ∈ A (which means that μ(Ψ) = -{A | Ψ(A) ⊆ A}). The coinduction principle dually states that ν(Ψ) contains every Ψ-consistent set, where a set A is called Ψ-consistent if, for all a ∈ A, there is some b ∈ D such that both (a, b) ∈ Ψ(A) and b ∈ A (which means that ν(Ψ) = {A | A ⊆ Ψ(a)}).

The intuition behind an inductively defined set is that of a "bottom-up" construction. That is, one starts with a set of initial elements and then applies the constructor operators finitely many times. One concrete example of an inductively defined set is that of finite lists, which can be constructed starting from the empty list and one constructor operator that adds an element to the head of the list. The finiteness restriction stems from the fact that induction is the smallest subset that can be constructed using the operators. Using the induction principle, one can show that all elements of an inductively defined set satisfy a certain property, by showing that the property is preserved for each

constructor operator. A coinductively defined set is also constructed by starting with a set of initial elements and applying the constructor operators, possibly infinitely many times. One example, which arises from the same initial element and constructors as the inductive set of lists, is that of possibly infinite lists, i.e. the set that also contains infinite streams. The fact that we can apply the operators infinitely many times is due to coinduction being the largest subset that can (potentially) be constructed using the operators. Using the coinduction principle, one can show that an element is in a coinductively defined set.

#### **2.2 Transitive (Co)closure Operators**

Throughout the paper we will use two instances of fixed points that provide a minimal framework which captures applicable forms of inductive and coinductive reasoning in an intuitive manner, and is more amenable for automation than the full theory of fixed points. This section introduces these fixed points and discusses the logical framework obtained by adding them to first-order logic.

**Definition 1 ((Post-)Composition Operator).** Given a binary relation, X, Ψ<sup>X</sup> is an operator on binary relations that post-composes its input with X, that is ΨX(R) = X ∪ (X ◦ R) = {(a, c) | (a, c) ∈ X ∨ ∃b . (a, b) ∈ X ∧ (b, c) ∈ R}.

Because unions and compositions are monotone operators over a complete lattice, so are composition operators, and therefore both μ(ΨX) and ν(ΨX) exist. A pair of elements, (a, b), is in μ(ΨX) when b is in every X-closed set that can be reached by some X-steps from a, which is equivalent to saying that there is a finite (non-empty) chain of X steps from a to b. A pair of elements, (a, b), is in ν(ΨX) when there exists a set A that contains a such that the set A\ {b} is X-consistent, which is equivalent to saying that either there is a finite (non-empty) chain of X steps from a to b, or there is an infinite chain of X steps starting from a.

The μ(ΨX) operator is in fact the standard transitive closure operator. Extending first-order logic (FOL) with the addition of this transitive closure operator results in the well-known transitive closure logic (a.k.a. ancestral logic), a generic, minimal logic for expressing finitary<sup>1</sup> inductive structures [48,73,5,24,25,23]. Transitive closure (TC) logic was recently extended with a dual operator, called transitive co-closure, that corresponds to ν(ΨX) [27]. The definition below presents the syntax and semantics of the extended logic, called Transitive (co)Closure logic, or TcC logic.

**Definition 2 (**TcC **Logic).** For σ a first-order signature, let s, t and P range over terms and predicate symbols over σ (respectively), and let M be a structure for σ, and ν a valuation in M.

**Syntax.** The language LT cC (over σ) is given by the following grammar:

$$\begin{aligned} \varphi, \psi ::= s = t \mid P(t\_1, \dots, t\_n) \mid \neg \varphi \mid \varphi \land \psi \mid \varphi \lor \psi \mid \varphi \to \psi \mid \forall x \,\,. \varphi \mid \exists x \,\,. \varphi \mid \varphi \\ (TC\_{x,y} \,\, \varphi)(s, t) \mid (TC\_{x,y}^{\textsf{op}} \,\, \varphi)(s, t) \end{aligned}$$

<sup>1</sup> See [40] for a formal definition of "finitary" inductive definitions.

where the variables x, y in the formulas (TC x,y ϕ)(s, t) and (TC op x,y ϕ)(s, t) are distinct and are bound in the subformula ϕ.

**Semantics.** The satisfaction relation M,ν |= ϕ extends the standard satisfaction relation of classical first-order logic with the following clauses:

$$\begin{aligned} M, \nu &= (TC\_{x,y}\,\varphi)(s,t) \Leftrightarrow \\ \exists (\mathbf{d}\_i)\_{i \le n} \,. \, d\_1 &= \nu(s) \wedge d\_n = \nu(t) \wedge \forall i < n \,. \, M, \nu[x := d\_i, y := d\_{i+1}] = \varphi \\ M, \nu &= (TC\_{x,y}^{\text{op}}\,\varphi)(s,t) \Leftrightarrow \\ \exists (\mathbf{d}\_i)\_{i > 0} \,. \, d\_1 &= \nu(s) \wedge \forall i > 0 \,. \, d\_i = \nu(t) \vee M, \nu[x := d\_i, y := d\_{i+1}] = \varphi \end{aligned}$$

where ν[x<sup>1</sup> := dn,...,x<sup>n</sup> := dn] denotes the valuation that maps x<sup>i</sup> to d<sup>i</sup> and behaves as ν otherwise; ϕ <sup>t</sup><sup>1</sup> <sup>x</sup><sup>1</sup> ,..., <sup>t</sup><sup>n</sup> x<sup>n</sup> denotes simultaneous substitution; and (*d*i)i≤<sup>n</sup> and (*d*i)i><sup>0</sup> denote, respectively, non-empty finite and (countably) infinite sequences of elements from the domain.

Intuitively, the formula (TC x,y ϕ)(s, t) asserts that there is a (possibly empty) finite ϕ-path from s to t, while the formula (TC op x,y ϕ)(s, t) asserts that either there is a (possibly empty) finite ϕ-path from s to t, or an infinite ϕ-path starting at s. For simplicity of presentation we take here the reflexive forms of the closure operators, which yields the following correspondence.<sup>2</sup>

**Proposition 1.** Let [[ϕ]]M,ν x,y := {(a, b) | M,ν[x := a, y := b] |= ϕ}.

$$(i)\ M, \nu \vdash (TC\_{x,y}\varphi)(s,t) \quad \Leftrightarrow \quad \nu(s) = \nu(t)\ o\ (\nu(s), \nu(t)) \in \mu(\mathbb{V}\_{\left[\varphi\right]\_{x,y}^{M,\nu}}).$$

$$(ii) \ M, \nu \vdash (T C\_{x,y}^{\mathsf{op}} \varphi)(s,t) \quad \Leftrightarrow \quad \nu(s) = \nu(t) \ o \ (\nu(s), \nu(t)) \in \nu(\stackrel{\cdot}{\mathbb{F}}\_{\left[\varphi\right]\_{x,y}^{M,\nu}}^{\cdot,\cdot}).$$

Note that, unlike the situation in standard fixed-point logics, the two closure operators are not inter-definable. The TC operator is definable in arithmetics (i.e. in Peano Arithmetics, PA), but the TC op operator is not.

Thus, TcC logic is subsumed by fixed-point logics, such as the first-order μ-calculus [64], but the concept of the transitive (co)closure is intuitively simpler than that of general fixed-point operators, and it does not require any syntactic restrictions to ensure monotonicity. In fact, due to its complexity and generality, the investigation of the full first-order μ-calculus tends to focus only on variants and fragments, and is mainly concentrated on the logical and model-theoretic aspects, lacking a comprehensive proof theory.<sup>3</sup>Another reason for focusing on these (co)closure operators is that they allow for the embedment of many forms of inductive and coinductive reasoning within one concise logical framework. Thus, while other extensions of FOL with inductive definitions are a priori parametrized by a set of inductive definitions [59,60,79,19], bespoke induction principles do

<sup>2</sup> The definition of the post-composition operator can be reformulated to incorporate the reflexive case, however, we opt to keep the more standard definition.

<sup>3</sup> Proof theory has been developed for the propositional modal μ-calculus fragment [51], and recently also for matching μ-logic [20,21,22] which generalizes the μ-calculus.

not need to be added to TcC logic; instead, applicable (co)induction schemes are available within a single, unified language. This conciseness allows the logic to be formally captured using one fixed set of inference rules, and thus makes it particularly amenable for automation. Moreover, in TcC logic, the same signature is shared for both inductive and coinductive data, making certain aspects of the relationship between the two principles more apparent.

Defining infinite structures via the coclosure operators in TcC logic leads to a symmetric foundation for functional languages where inductive and coinductive data types can be naturally mixed. For example, using the standard list constructors (the constant nil and the (infix) binary function symbol '::') and their axiomatization, the collections of finite lists, possibly infinite lists, and infinite lists (i.e., streams) are straightforwardly definable as follows.

$$\begin{aligned} \text{List}(\sigma) &:= (TC\_{x,y} \,\exists a. \, x = a :: y) (\sigma, \text{nil}) \\ \text{List}^{\infty}(\sigma) &:= (TC\_{x,y}^{\text{op}} \,\exists a. \, x = a :: y) (\sigma, \text{nil}) \\ \text{Stremam}(\sigma) &:= (TC\_{x,y}^{\text{op}} \,\exists a. \, x = a :: y \land y \neq \text{nil}) (\sigma, \text{nil}) \land \sigma \neq \text{nil} \end{aligned}$$

TcC logic also naturally captures properties of, and functions on, streams [29].

#### **3 Non-well-founded Deduction for Induction**

This section presents the general method of non-well-founded proof theory (Section 3.1), and then provides a concrete example of a non-well-founded proof system for inductive reasoning in the setting of the transitive closure (Section 3.2), where the implicit form of inductive reasoning is then compared against the explicit one. Note that this section first presents the proof theory only for TC logic, which is the inductive fragment of TcC logic, i.e., the one based only on the transitive closure operator.

#### **3.1 Non-well-founded Proof Theory**

The method of non-well-founded proofs provides an alternative approach to explicit inductive reasoning by exploiting the fact that there are no infinite descending chains of elements of well-ordered sets. Clearly, not all non-wellfounded proof trees constitute a valid proof, i.e. a proof of the validity of the conclusion in the root. A proof tree that simply has one loop over the conclusion or one that repeatedly uses the substitution or permutation rules to obtain cycles are examples of non-well-founded proof trees that one would not like to consider as valid. Thus, a non-well-founded proof tree is allowed to be infinite, but to be considered as a valid proof, it has to obey an additional requirement that prevents such unsound deductions. Hence, non-well-founded proofs are subject to the restriction that every infinite path in the proof admits some infinite descent. Intuitively, the descent is witnessed by tracing syntactic elements, terms or formulas, for which we can give a correspondence with elements of a well-founded set. In this respect, non-well-founded proof theory enables a separation between

local steps of deductive inference and global well-foundedness arguments, which are encoded in traces of terms or formulas through possibly infinite derivations.

Below we present proof systems in the style of sequent calculus. Sequents are expressions of the form Γ ⇒ Δ, for finite sets of formulas Γ and Δ. We write Γ, ϕ as a shorthand for Γ ∪ {ϕ}, and fv(Γ) for the set of free variables of the formulas in Γ. A sequent Γ ⇒ Δ is valid if and only if the formula <sup>ϕ</sup>∈<sup>Γ</sup> <sup>ϕ</sup> <sup>→</sup> <sup>ψ</sup>∈<sup>Δ</sup> <sup>ψ</sup> is.

Let S be a collection of inference rules. First, we define the notion of a non-well-founded proof tree, a pre-proof, based on S.

**Definition 3 (Pre-proofs).** A pre-proof in S is a possibly infinite derivation tree formed using the inference rules of S. A path in a pre-proof is a possibly infinite sequence of sequents, s0, s1,...(, sn), such that s<sup>0</sup> is the root sequent of the proof, and si+1 is a premise of s<sup>i</sup> in the derivation tree for each i<n.

As mentioned, not every pre-proof is a proof: only those in which there is some notion of infinite descent in every infinite branch, which allows one to formalize inductive arguments. To make this concrete, one picks some syntactic element, which can be formulas or terms, to be tracked through a pre-proof. We call such elements traced elements. The intuition behind picking the traced elements is that eventually, when we are given a pre-proof, we could trace these elements through the infinite branches, and map them into some well-founded set. This is what underpins the soundness of the non-well-founded method, as explained below. Given certain traced elements, we inductively define a notion of trace pairs which corresponds to the appearances of such traced elements within applications of the inference rules throughout the proof. That is, for traced elements, τ,τ , and a rule with conclusion s and a premise s such that τ appears in s and τ appears in s , (τ,τ ) is said to be a trace pair for (s, s ) for certain rule applications, and there has to be at least one case identified as a progressing trace pair. The progression intuitively stands for the cases in which the elements of the trace pair are mapped to strictly decreasing elements of the well-founded set. We provide a concrete example of traced elements and a trace pair definition in the transitive closure setting in Section 3.2.

**Definition 4 (Traces).** A trace is a (possibly infinite) sequence of traced elements. We say that a trace τ1, τ2,...(, τn) follows a path s1, s2,...(, sm) in a pre-proof P if, for some k ≥ 0, each consecutive pair of formulas (τi, τi+1) is a trace pair for (si+k, si+k+1). If (τi, τi+1) is a progressing pair, then we say that the trace progresses at i, and we say that the trace is infinitely progressing if it progresses at infinitely many points.

Proofs, then, are pre-proofs which satisfy a global trace condition.

**Definition 5 (Infinite Proofs).** A proof is a pre-proof in which every infinite path is followed by some infinitely progressing trace.

We denote by S<sup>∞</sup> the non-well-founded proof system based on the rules in S. The general soundness argument for such infinite systems follows from a combination of standard local soundness of the inference rules in S together

$$\begin{array}{c} \begin{array}{c} \Gamma \Rightarrow \Delta, \left(TC\_{x,y}\,\varphi\right)(s,s) \end{array} \xrightarrow{(TC\_{nf})} \begin{array}{c} \Gamma \Rightarrow \Delta, \varphi\left\{\frac{s}{x},\frac{z}{y}\right\} \quad \Gamma \Rightarrow \Delta, \left(TC\_{x,y}\,\varphi\right)(r,t) \end{array} \left(TC\_{R}\right) \\\\ \begin{array}{c} \Gamma,s = t \Rightarrow \Delta \quad \Gamma,\varphi\left\{\frac{s}{x},\frac{z}{y}\right\},\overline{\left[\left(TC\_{x,y}\,\varphi\right)(z,t)\right]} \Rightarrow \Delta \\\\ \hline \Gamma,(TC\_{x,y}\,\varphi)(s,t) \Rightarrow \Delta \end{array} \left(TC\_{L}^{\mathrm{im}}\right) \end{array} \end{array} (TC\_{L}^{\mathrm{im}})$$
 
$$\begin{array}{c} \Gamma,\psi(x),\varphi(x,y) \Rightarrow \Delta,\psi\left\{\frac{y}{x}\right\} \\\\ \hline \Gamma,\psi\left\{\frac{s}{x}\right\},(TC\_{x,y}\,\varphi)(s,t) \Rightarrow \Delta,\psi\left\{\frac{t}{x}\right\} \end{array} \left(TC\_{L}^{\mathrm{ex}}\right)$$

where in (TC im <sup>L</sup> ), <sup>z</sup> ∈ fv(Γ, Δ, (TC x,y <sup>ϕ</sup>)(s, t)), and in (TC ex <sup>L</sup> ), x ∈ fv(Γ, Δ) and y ∈ fv(Γ, Δ, ψ).

Fig. 1: Proof rules for the TC operator

with a global soundness argument via an infinite descent-style construction, due to the presence of infinitely progressing traces for each infinite path in a proof. One assumes for contradiction that the conclusion of the proof is invalid, which, by the local soundness of the rules, entails the existence of an infinite sequence of counter-models, going along an infinite branch. Then, one demonstrates a mapping of these models into a well-founded set, (D, <), which decreases while following the sequence of counter-models, and strictly decreases when going over progression points. But then, by the global trace condition, there exists an infinitely descending chain in D, which of course yields a contradiction.

While a full infinitary proof system is clearly not effective, effectiveness can be obtained by restricting consideration to the cyclic proofs, i.e., those that are finitely representable. These are the regular infinite proof trees, which contain only finitely many distinct subtrees. Intuitively, the cycles in the proofs capture the looping nature of inductive arguments and, thereby, the cyclic framework provides the basis for an effective system for automated inductive reasoning. A possible way of formalizing such proof graphs is as standard proof trees containing open nodes, called buds, to each of which is assigned a syntactically equal internal node of the proof, called a companion (see, e.g., [19, Sec.7] for a formal definition).

**Definition 6 (Cyclic Proofs).** The cyclic proof system <sup>S</sup><sup>ω</sup> is the subsystem of S<sup>∞</sup> comprising of all and only the finite and regular infinite proofs (i.e., those proofs that can be represented as finite, possibly cyclic, graphs).

#### **3.2 Explicit vs. Implicit Induction in Transitive Closure Logic**

Since we focus on the formal treatment of induction in this section, we here present the proof systems for TC logic, i.e., the logic comprising only the TC operator extension. Both proof systems presented are extensions of LK=, the sequent calculus for classical first-order logic with equality [44]. <sup>4</sup>

Figure 1 presents proof rules for the TC operator. Rules (TC ref ), (TC <sup>R</sup>) assert the reflexivity and the transitivity of the TC operator, respectively. Rule

<sup>4</sup> Here LK<sup>=</sup> includes a substitution rule, which was not a part of the original systems.

(TC ex <sup>L</sup> ) can be intuitively read as follows: if the extension of ψ is ϕ-closed, then it is also closed under the reflexive transitive closure of ϕ. Rule (TC im <sup>L</sup> ) is in a sense a case-unfolding argument, stating that to prove something about the reflexive transitive closure of ϕ, one must prove it for the base case (i.e., s = t) and also prove it for one arbitrary decomposition step (i.e., where the ϕ-path is decomposed to the first step and the remaining path).

The explicit (well-founded) proof system STC is based on rules (TC ref ), (TC <sup>R</sup>) and (TC ex <sup>L</sup> ). The implicit (non-well-founded) proof system S<sup>∞</sup> TC is based on rules (TC ref ), (TC <sup>R</sup>) and (TC im <sup>L</sup> ), and its cyclic subsystem is denoted by S<sup>ω</sup> TC. In S<sup>∞</sup> TC, the traced elements are TC formulas on the left-hand side of the sequents, and the points of progression are highlighted in blue in Figure 1. The soundness of the S<sup>∞</sup> TC system is then underpinned by mapping each model of an TC formula of the form (TC x,y ϕ)(s, t) to the minimal length of the ϕ-path between s and t.

Rules (TC ex <sup>L</sup> ) and (TC im <sup>L</sup> ) both offer a unified treatment of inductive reasoning, in the sense that bespoke induction principles do not need to be added to the systems. A big advantage of the implicit system is that it can ameliorate the major challenge in automating inductive reasoning of finding the induction invariant a priori. Indeed, a major difference between these two induction rules is the presence of the induction invariant. In (TC ex <sup>L</sup> ), unlike in (TC im <sup>L</sup> ), there is an explicit appearance of the induction invariant, namely ψ. Instead, in S<sup>∞</sup> TC, the induction invariant, which is often stronger than the goal one is attempting to prove, can (usually) be inferred via the cycles in the proof.

Since TC logic subsumes arithmetics, by G¨odel's result, the system STC, while sound, is incomplete with respect to the standard semantics.<sup>5</sup> Nonetheless, the full non-well-founded proof system S<sup>∞</sup> TC is sound and (cut-free) complete for TC logic [28,26]. Furthermore, the cyclic subsystem S<sup>ω</sup> TC subsumes the explicit system STC.

# **4 Adding Coinductive Reasoning**

This section extends the non-well-founded proof theory of TC logic from Section 3.2 to support the transitive coclosure operator, and thus the full TcC logic (Section 4.1). We then provide an illustrative example of the use of the resulting framework, demonstrating its potential for automated proof search (Section 4.2).

## **4.1 Implicit Coinduction in Transitive (Co)closure Logic**

The implicit (non-well-founded) proof system for TcC logic, denoted S<sup>∞</sup> TcC, is an extension of the system S<sup>∞</sup> TC, obtained by the addition of the proof rules for the TC op operator presented in Figure 2. Again, rules (TC op ref ), (TC op <sup>R</sup> ) state the reflexivity and transitivity of the TC op operator, respectively, and rule (TC op <sup>L</sup> ) is a case-unfolding argument. However, unlike the case for the TC op operator in which rule (TC im <sup>L</sup> ) can be replaced by a rule that decomposes the path from the

<sup>5</sup> STC is sound and complete with respect to a generalized form of Henkin semantics [23].

$$\begin{array}{c} \Gamma \Rightarrow \Delta, \left( \boldsymbol{T} \boldsymbol{C}\_{x,y}^{\operatorname{op}} \varphi \right)(s,s) \stackrel{\scriptstyle(\boldsymbol{T}\mathcal{C}\_{x\boldsymbol{y}}^{\operatorname{op}})}{\longrightarrow} \quad \frac{\Gamma \Rightarrow \Delta, \varphi \left\{ \frac{\boldsymbol{z}}{\boldsymbol{z}}, \frac{\boldsymbol{z}}{\boldsymbol{y}} \right\} \quad \Gamma \Rightarrow \Delta, \left[ \left( \boldsymbol{T} \boldsymbol{C}\_{x,y}^{\operatorname{op}} \varphi \right)(\boldsymbol{r},t) \right]}{\longrightarrow} \; (\boldsymbol{T} \boldsymbol{C}\_{x,y}^{\operatorname{op}} \varphi)(s,t) \\\\ \Gamma, s = t \Rightarrow \Delta \qquad \Gamma, \varphi \left\{ \frac{\boldsymbol{z}}{\boldsymbol{z}}, \frac{\boldsymbol{z}}{\boldsymbol{y}} \right\}, (\boldsymbol{T} \boldsymbol{C}\_{x,y}^{\operatorname{op}} \varphi)(\boldsymbol{z},t) \Rightarrow \Delta \\ \hline \quad \Gamma, (\boldsymbol{T} \boldsymbol{C}\_{x,y}^{\operatorname{op}} \varphi)(s,t) \Rightarrow \Delta \\ \text{where in } (\boldsymbol{T} \boldsymbol{C}\_{L}^{\operatorname{op}}), \; z \notin \mathsf{h}(\boldsymbol{\Gamma}, \Delta, (\boldsymbol{T} \boldsymbol{C}\_{x,y}^{\operatorname{op}} \varphi)(s,t)). \end{array}$$

Fig. 2: Proof rules for the TC op operator

end, in rule (TC op <sup>L</sup> ) it is critical that the decomposition starts at the first step (as there is no end point). Apart from the additional inference rules, S<sup>∞</sup> TcC also extends the traced elements to include TC op formulas, which are traced on the right-hand side of the sequents, and the points of progression are highlighted in pink in Figure 2.

Interestingly, the two closure operators are captured proof-theoretically using inference rules with the exact same structure. The difference proceeds from the way the decomposition of the corresponding formulas is traced in a proof derivation: for induction, TC formulas are traced on the left-hand sides of the sequents; for coinduction, TC op formulas are traced on the right-hand sides of sequents. Thus, traces of TC formulas show that certain infinite paths cannot exist (induction is well-founded), while traces of TC op formulas show that other infinite paths must exist (coinduction is productive). This formation of the rules for the (co)closure operators is extremely useful with respect to automation, as the rules are locally uniform, thus enabling the same treatment for induction and coinduction, but are also globally dual, ensuring that the underlying system handles them appropriately (at the limit). Also, just like the case for induction, the coinduction invariant is not explicitly mentioned in the inference rules.

The full non-well-founded system S<sup>∞</sup> TcC is sound and (cut-free) complete with respect to the semantics of TcC logic [27]. It has been shown to be powerful enough to capture non-trivial examples of mixed inductive and coinductive reasoning (such as the transitivity of the substream relation), and to provide a smooth integration of induction and coinduction while also highlighting their similarities. To exemplify the naturality of the system, Figure 3 demonstrates a proof that the transitive closure is contained within the transitive co-closure. The proof has a single cycle (and thus a single infinite path), but, following this path, there is both a trace, consisting of the TC formulas highlighted in blue, and a co-trace, consisting of the TC op formulas highlighted in pink (the progression points are marked with boxes). Thus, the proof can be seen both as a proof by induction and as a proof by coinduction.

#### **4.2 Applications in Automated Proof Search**

The cyclic reasoning method seems to have enormous potential for the automation of (co)inductive reasoning, which has not been fully realized. Most notably, as

Fig. 3: Proof that the TC op operator subsumes the TC operator

mentioned, cyclic systems can facilitate the discovery of a (co)induction invariant, which is a primary challenge for mechanized (co)inductive reasoning.<sup>6</sup> Thus, in implicit systems, the (co)inductive arguments and hypotheses may be encoded in the cycles of a proof, in the sense that when developing the proof, one can start with the goal and incrementally adjust the invariant as many times as necessary. Roughly speaking, one can perform lazy unfolding of the (co)closure operators to a point in which a cycle can be obtained, taking advantage of non-local information retrieved in other branches of the proof.

The implications of these phenomena for proof search can be examined using proof-theoretic machinery to analyze and manipulate the structures of cyclic proofs. For example, when verifying properties of mutually defined relations, the associated explicit (co)induction principles are often extremely complex. In the cyclic framework, such complex explicit schemes generally correspond to overlapping cycles. Exploring such connections between hard problems that arise from explicit invariants and the corresponding structure of cyclic proofs, can facilitate automated proof search. The cyclic framework offers yet another benefit for verification in that it enables the separation of the two critical properties of a program, namely liveness (termination) and safety (correctness). Thus, while proving a safety property (validity of a formula), one can extract liveness arguments via infinite descent.

#### **4.2.1 Program Equivalence in the TcC Framework**

The use of the (co)closure operators in the TcC framework seems to be particularly well-suited for formal verification, as these operators can be used to simultaneously express the operational semantics of programs and the structure of the (co)data manipulated by them. Use of the same constructors for both features of the program constitutes an improvement over current formal frameworks, which

<sup>6</sup> Some verification approaches can discover inductive invariants automatically [43,45], or direct their construction based on the property being verified [63,50], but they do not currently support coinductive reasoning.

```
rest := fix rest(f).λn. if n > 0 then (output n; rest f (n − 1)) else f 0
   f := fix f(n). let v = (output n; input()) ∗ 2 in (if v = 0 then f else rest f) (v + n)
   g := fix g(m). output (2 ∗ m); let v = input() in if v = 0 then rest g (2 ∗ m) else g (v + n)
```

```
RES :=
```

```
(TC -
       u1,u2,-
               v1,v2 (u1 > 0 ∧ v1 = u1 − 1 ∧ u2 = u1 :: v2) ∨ (u1 = v1 = 0 ∧ u2 = v2))(n, s, 0, s
                                                                                                       )
ψf := ∃i, w. x2 = i :: w∧
      [(i ∗ 2 = 0 ∧ y1 = i ∗ 2 + x1 ∧ w = x1 :: y2) ∨ (i = y1 = 0 ∧ RES(x1, w, x1 :: y2))]
ψg := ∃i, w. x2 = i :: w∧
      [(i = 0 ∧ y1 = i + x1 ∧ w = (2 ∗ x1) :: y2) ∨ (i = y1 = 0 ∧ RES(2 ∗ x1, w, (2 ∗ x1) :: y2))]
SPEC : (TC op
               -
                x1,x2,-
                        y1,y2 ψf)(2 ∗ m, s, ⊥, ⊥) ⇐⇒ (TC op
                                                                -
                                                                 x1,x2,-
                                                                         y1,y2 ψg)(m, s, ⊥, ⊥)
```
Fig. 4: The recursive programs and their formalization in TcC

usually employ qualitatively different formalisms to describe the operational semantics of programs and the associated data.<sup>7</sup> For instance, although many formalisms employ separation logic to describe the data structures manipulated by programs (e.g., the Cyclist prover [18]), they also encode the relationships between the program's memory and its operational behavior via bespoke symbolicexecution inference rules [10,65].

To demonstrate the capabilities and benefits of the TcC framework for verification and automated proof search, we present the following example, posed in [47, Sec. 3]. The example consists of proving that the two recursive programs given in Figure 4 (weakly) simulate one another. Both programs continually read the next input, compute the double of the sum of all inputs seen so far, and output the current sum. On input zero, both programs count down to zero and start over. The goal is to formally verify that g(m) is equivalent to f(2m). However, as noted in [47], a formal proof of this claim via the standard Tarskian coinduction principle is extremely laborious. This is mainly because one must come up with an appropriate "simulation relation" that contains all the intermediate execution steps of f and g, appropriately matched, which must be fully defined before we can even start the proof.

The (co)closure operators offer a formalization of the problem which is very natural and amenable to automation, formalizing the programs by encoding all (infinite) traces of f and g as streams of input/output events. Hence, the simulation amounts to the fact that each such stream for f can be simulated by g, and vice versa. The bottom part of Figure 4 shows the formalization of the specification in TcC logic, where the encoding of each program is a natural simplification that can easily (and automatically) be obtained from either structural operational semantics or Floyd–Hoare-style axiomatic semantics. We use ⊥ as a designated unreachable element (i.e., an element not related to any other element). The fact

<sup>7</sup> Notable exceptions include [66,76,20,21,22], which take a similar approach but invoke second-order elements.

Fig. 5: Structure of the proof of one direction of SPEC

that the (co)closure operators can be applied to complex formulas that include, for example, quantifiers, disjunctions and nesting of the (co)closure operators, enables a concise, natural presentation without resorting to complex case analysis. This offers a significant a priori simplification of the formula we provide to the proof system (and, in turn, to a prover), even before starting the proof-search procedure.

The cyclic proof system, in turn, enables a natural treatment of the coinductive reasoning involved in the proof, in a way that is particularly amenable to automation. Figure 5 outlines the structure of the proof of one direction of the equivalence defined in SPEC. For conciseness, the subscripts x1, x2,y1, y2 are omitted from all TC op formulas and we use (TC op <sup>ϕ</sup>)⊥(u, v) as a shorthand for (TC op <sup>ϕ</sup>)(u, v,⊥, ⊥). The proof is compact and the local reasoning is standard: namely, the unfolding of the TC op operator. The proof begins with a single unfolding of the TC op formula on the left and then proceeds with its unfolding on the right. The key observation is that the instantiation of the unfolding on the right (i.e., the choice of the term r in Rule (TC op <sup>R</sup> )) can be automatically inferred from the terms of the left unfolding, by unification. Thus, when applying Rule (TC op <sup>R</sup> ), one does not have to guess the intermediate term (in this case, z1/2, z2); instead, the term can be automatically inferred from the equalities in the subproof of the single-step implication, as illustrated by the green question marks in Figure 5.

Finally, to formally establish the correctness of our simplified formalization, one needs to prove that, for example, the abstract RES(n, s, s ) is indeed equivalent to the concrete program restart on f and on g. This can be formalized and proved in a straightforward manner, as the proof has a dual structure and contains a TC cycle. This further demonstrates the compositionality of TcC framework, as such an inductive subproof is completely independent of the general, outer coinductive TC op cycle.

#### **5 Perspectives and Open Questions**

As mentioned, the approach of non-well-founded proof theory holds great potential for improving the state-of-the-art in formal support for automated inductive and coinductive reasoning. But the investigation of cyclic proof systems is far from complete, and much work is still required to provide a full picture. This section concludes by describing two key research questions, one concerning the applicability of the framework and the other concerning the fundamental theoretical study of the framework.

#### **5.1 Implementing Non-well-founded Machinery**

Current theorem provers offer little or no support for implicit reasoning. Thus, major verification efforts are missing its great potential for lighter, more legible and more automated proofs. The main implementation of cyclic reasoning can be found in the cyclic theorem prover Cyclist [18], which is a fully automated prover for inductive reasoning based on the cyclic framework developed in [15,16,19]. Cyclist has been very successful in formal verification in the setting of separation logic. Cyclic inductive reasoning has also been partially implemented into the Coq proof assistant through the development of external libraries and functional schemas [77]. Both implementations do not support coinductive reasoning, however.

To guarantee soundness, and decide whether a cyclic pre-proof satisfies the global trace condition, most cyclic proof systems feature a mechanism that uses a construction involving an inclusion between B¨uchi automata (see, for example, [15,74]). This mechanism can be (and has been) applied successfully in automated frameworks, but it lacks the transparency and flexibility that one needs in interactive theorem proving. For example, encoding proof validity into B¨uchi automata makes it difficult to understand why a cyclic proof is invalid in order to attempt to fix it. Therefore, to fully integrate cyclic reasoning into modern interactive theorem provers in a useful manner, an intrinsic criterion for soundness must be developed, which does not require the use of automata but instead operates directly on the proof tree.

#### **5.2 Relative Power of Explicit and Implicit Reasoning**

In general, explicit schemes for induction and coinduction are subsumed by their implicit counterparts. The converse, however, does not hold in general. In [19], it was conjectured that the explicit and cyclic systems for FOL with inductive definitions are equivalent. Later, they were indeed shown to be equivalent when containing arithmetics [19], where the embedding of the cyclic system in the explicit one relied on an encoding of the cycles in the proof. However, it was also shown, via a concrete counter-example, that in the general case the cyclic system is strictly stronger than the explicit one [9]. But a careful examination of this counter-example reveals that it only refutes a weak form of the conjecture, according to which the inductive definitions available in both systems are the

same. That is, if the explicit system is extended with other inductive predicates, the counter-example for the equivalence no longer holds. Therefore, the less strict formulation of the question—namely, whether for any proof in the cyclic system there is a proof in the explicit system for some set of inductive predicates—has not yet been resolved. In particular, in the TcC setting, while the equivalence under arithmetics also holds, the fact that there is no a priori restriction on the (co)inductive predicates one is allowed to use makes the construction of a similar counter-example in the general case much more difficult. In fact, the explicit and cyclic systems may even coincide for TcC logic.

Even in cases where explicit (co)induction can capture implicit (co)induction (or a fragment of it), there are still open questions regarding the manner in which this capturing preserves certain patterns. A key question is whether the capturing can be done while preserving important properties such as proof modularity. Current discourse contains only partial answers to such questions [75,77,68] which should be investigated thoroughly and systematically. The uniformity provided by the closure operators in the TcC setting can facilitate a study of this subtle relationship between implicit and explicit (co)inductive reasoning.

**Acknowledgements.** As mentioned in the introduction, the TcC framework is based on a wonderful ongoing collaboration with Reuben Rowe. The author is also extremely grateful to Andrei Popescu and Shachar Itzhaky for their contributions to the framework.

# **References**


*26*th *International Conference on Automated Reasoning with Analytic Tableaux and Related Methods, TABLEAUX 2017*, pages 247–260, Cham, 2017.


*ference on Foundations of Software Science and Computation Structures, FOSSACS 2002*, pages 357–371, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4. 0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Towards the Automatic Mathematician**

Markus N. Rabe and Christian Szegedy

Google Research Mountain View, California, USA {mrabe,szegedy}@google.com

**Abstract.** Over the recent years deep learning has found successful applications in mathematical reasoning. Today, we can predict fine-grained proof steps, relevant premises, and even useful conjectures using neural networks. This extended abstract summarizes recent developments of machine learning in mathematical reasoning and the vision of the N2Formal group at Google Research to create an automatic mathematician. The second part discusses the key challenges on the road ahead.

**Keywords:** Automated reasoning · machine learning · mathematical reasoning · theorem proving · natural language understanding.

# **1 Introduction**

The combination of machine learning and mathematical reasoning goes back at least to the 2000s when Stephan Schulz pioneered ideas to use machine learning to control the search process [44], and Josef Urban used machine learning to select relevant axioms [46,47]. With the advent of deep learning, interest in the area surged, as deep learning promises to enable the automatic discovery of new knowledge from data, while requiring minimal engineering. This suddenly offered a flurry of new possibilities also for theorem proving.

One of the most challenging and impactful tasks in automated theorem proving is *premise selection*, that is to find relevant premises from a large body of available theorems/axioms. Many classical reasoning systems do not scale well into thousands of potentially relevant facts, but some pioneering results by Urban et al. [47] proposed fast machine learning techniques using manually engineered features. However, with the inroads of deep learning, it has become clear that large quality improvements are possible by utilizing deep learning techniques. DeepMath [24] demonstrated that premise selection could be tackled with deep learning, directly (i.e., without feature engineering) applying neural networks to the text of the premise and that of the (negated) conjecture.

In DeepMath, both premise and conjecture are embedded into a vector space by a (potentially expensive) neural network and then a second (preferably cheap) neural network compares the embedding of the current state to each available premise to judge whether the premise is useful. Loos et al. [36] for the first time, demonstrated that the same approach as DeepMath yields substantial improvements as an internal guidance method within a first-order automated theorem prover.

*Neural Theorem Provers.* Emboldened by these early works and by breakthroughs in deep learning, several groups extended interactive theorem provers<sup>1</sup> for the use in deep learning research, including Gamepad [23], HOList [5], Coq-Gym [54], GPT-f [39], and recently TacticZero [51]. A typical tactic application predicted by these systems looks as follows (here in HOL Light syntax):

$$\underbrace{\text{PERSTLE.TAC}}\_{\textit{facttic\ name}} \underbrace{\text{\u1\;PERSTSE1}}\_{\textit{list\ of\ premises}};\underbrace{\text{PERSTSE2}}\_{\textit{'}}]$$

This specific tactic expects the given premises to be equalities, with which it attempts to rewrite subexpressions in the current proof goal. The hard part about predicting good tactics is to select the right list of premises from all the previously proven theorems. Some tactics also include free-form expressions, which can be a challenge as well.

In contrast to approaches using lightweight machine learning approaches (e.g. [13,25,26,38,31]), neural theorem provers aim to replicate the human approach to proving theorems in ITPs, searching only through a relatively small number (e.g., hundreds) of proof steps that are very promising. To get highquality proof steps, increasingly large neural networks (currently up to 774M parameters [39]) are trained on human proofs, or with reinforcement learning.

Already, neural theorem provers can prove a significant portion (up to 70% [4]) of test theorems and some have found proofs that are shorter and more elegant than the proofs that human mathematicians have formalized in these systems. For example, for the theorem CLOSURE CONVEX INTER AFFINE, proven with over 40 tactic calls in HOL Light [20], HOList/DeepHOL has found a proof with just two tactic calls:

```
let CLOSURE_CONVEX_INTER_AFFINE = prove
  ('!s t:real^N->bool.
      convex s /\ affine t /\ ~(relative_interior s INTER t = {})
      ==> closure(s INTER t) = closure(s) INTER t',
    SIMP_TAC [INTER_COMM; AFFINE_IMP_CONVEX;
              CLOSURE_INTER_CONVEX; RELATIVE_INTERIOR_AFFINE]
```
THEN

ASM\_MESON\_TAC [RELATIVE\_INTERIOR\_EQ\_CLOSURE; INTER\_COMM; RELATIVE\_INTERIOR\_UNIV; IS\_AFFINE\_HULL]);;

<sup>1</sup> The focus has been on interactive theorem provers as they are general enough to capture most of mathematics in theory, and several large-scale formalization efforts of the last decades have demonstrated that involved theories can be formalized in practice [28,14,19]. Also ITPs offer relatively short proofs compared to other automated reasoning tools, which allows us to use stronger neural networks for the same computational budget.

Similarly, Polu et al. reported several cases where they found proofs with their neural theorem prover GPT-f that were shorter and more elegant than than those found by humans [39].

*Neural Solvers.* Closely related to neural theorem provers are methods that, instead of predicting proof steps, directly predict the solution to mathematical problems. A first impressive example was proposed by Selsam et al., who showed that graph neural networks can predict satisfying assignments of small Boolean formulas [45]. Lample and Charton have demonstrated that also higher-level representations, such as the integral of a formula, can be predicted directly using a Transformer [29]. They exploited the fact that for some mathematical operations, such as taking the integral, the inverse operation (taking the derivative) is much easier. Hence, they can train on predicting generated formulas from their derivative without needing a tool that can generate the integral in the first place. Recently, Hahn et al. demonstrated that also classical verification problems, such as LTL satisfiability, can be solved directly with Transformers, beating existing tuned algorithms on their own dataset in some cases [18].

# **2 Towards the Automatic Mathematician**

We are convinced that the success of neural theorem provers and neural solvers is only the beginning of a larger development in which deep learning will revolutionize automated reasoning, and have set out to build an *automatic mathematician*. Ideally, we could simply talk to an automatic mathematician like a colleague, and it would be able to contribute to mathematical research, for example by publishing papers without human support.

An automatic mathematician would thus go far beyond theorem proving, as it would have to formulate and explore its own theories and conjectures, and be able to communicate in natural language. Yet, we believe that neural theorem provers are an important instrument of our plan, as they allow us to evaluate (generated) conjectures, which grounds the learning process in mathematically correct reasoning steps. And because neural theorem provers build on existing interactive theorem provers, they already come with a nucleus of formalized mathematics that we believe might be necessary to bootstrap the understanding of mathematics. In the following, we review some of the main challenges on the path towards an automatic mathematician and first approaches to address them.

#### **2.1 Neural Network Architectures**

Naturally, we need neural network architectures that can "understand" formulas, that is, make useful predictions based on formulas. The main question for the design of neural networks appears to be *whether* and, if yes, *how* to exploit the tree structure of formulas.

*Exploiting the Structure of Formulas.* It is tempting to believe that the embeddings of formulas should represent their semantics. Hence, many authors have suggested to process formulas with tree-structured recurrent neural networks (TreeRNNs), which compute embeddings of expressions from the embeddings of their subexpressions, as this resembles the bottom-up way we define their semantics (e.g., [11,1,23,54]). That intuition, however, may be misleading. In our experiments, bottom-up TreeRNNs have performed significantly worse than top-down architectures (followed by a max-pool aggregation) [37]. This suggests that, to make good predictions based on formulas, it is important to consider subformulas in their context, which bottom-up TreeRNNs cannot do easily.

*Sequence Models.* The alternative to representing the formula structure in the neural architecture is to interpret formulas simply as sequences of characters or symbols and apply sequence models. Early works using sequence modeling relied on convolutional networks (simple convolutional networks [24] and wave-nets [36,5]), which compared favorably to gated recurrent architectures like LSTM/GRU. With the recent rise of the Transformer architecture [48] sequence models have caught up to those that exploit the formula structure and yielded excellent performance in various settings [29,41,52,39,18].

Sequence models come with two major advantages: First, it is straightforward to not only read formulas, but also generate formulas, which is surprisingly challenging with TreeRNNs or graph neural networks. This allows us to directly predict proof steps as strings [39,52], and to tackle a wider range of mathematical reasoning tasks, such as predicting the integral of a formula [29], satisfying traces for formulas in linear temporal logics [18], or even more creative tasks, such as missing assumptions and conjectures [41].<sup>2</sup> Second, transformer models have shown a surprising flexibility and promise a uniform way to process not only formulas, but also natural language, and even images [10]. This could prove crucial for processing natural language mathematics, which frequently contains formulas, text, and diagrams, and any model processing papers would need to understand how they relate to each other. Transformers certainly set a high bar for the flexibility, generality, and performance of future neural architectures.

*Large Models.* Scaling up language models to larger and larger numbers of parameters has steadily improved their results [27,22]. Also when we use language models for mathematics, we have observed that larger models tend to improve the quality of predictions [39,41]. GPT-3 has shown that certain abilities, such as basic arithmetic, appear to only materialize in models with at least a certain number of parameters [6]. If this turns out to be true for other abilities, this raises the question how large models have to be to exhibit human-level mathematical reasoning abilities.

<sup>2</sup> Yet, there are still cases where hard-coding some formula structure in transformer architectures can improve the results, as shown, for example, by Wu et al. [21,35,18], which suggests that transformers are not the end of the story regarding formula understanding.

There is also the question of how exactly to scale up models. The mere number of parameters may not be as important as how we use them. More efficient alternatives to simply scaling up the transformer architecture might help with the problem to make large models accessible to more researchers (e.g., [32]).

#### **2.2 Training Methodology**

Neural networks have shown the ability to learn even advanced reasoning tasks via supervised learning, given the right training data. However, for many interesting tasks, we do not have such data and hence the question is how to train neural networks for tasks for which we have only limited data or no data at all.

*Reinforcement Learning.* Reinforcement learning can be seen as a way to reduce the amount of human-written proof data needed to learn a strong theorem prover. By training on the proofs generated by the system itself, we can improve its abilities to some extent, and the perhaps strongest neural theorem provers often use some form reinforcement learning (e.g., up to 70% of the proofs in HOL Light [4]). But, for an open-ended training methodology, we need a system that can effectively explore new and interesting theories, without getting lost in irrelevant branches of mathematics. Partial progress has been made in training systems without access to human-written proofs [4,51], and to generate conjectures to train on in a reinforcement learning setting [12], but the problem is wide-open.

*Pretraining.* In natural language understanding it is already common practice to pretrain transformers on a large body text before fine-tuning them on the final task, especially when only limited data is available for that task. Even though the pretraining data is only loosely related to the final tasks, transformers benefit a lot from pretraining, as it contains general world knowledge and useful inductive biases [9]. Polu et al. have shown that the same can be observed when pretraining transformers on natural language texts from arXiv [39].

*Self-supervised Training.* The GPT models for natural language have shown that self-supervised language modeling (i.e., only "pre"training without training on any particular task) alone can equip transformers with surprising abilities [42,6]. Mathematical reasoning abilities, including type inference, predicting missing assumptions and conjecturing, can be learned in a very similar way by training transformers to predict missing subexpressions (skip-tree training) [41].

Lample et al. devised several clever approaches to train transformers when data is not directly available. In unsupervised translation training transformers successfully learn to translate between different natural languages starting only with monolingual corpora and without any corresponding pairs of sentences [30]. This approach was even generalized to learn to translate between programming languages without corresponding pairs of programs in different languages [43]. The application of these unsupervised translation ideas to mathematics is tempting, but we experienced that their straight-forward application does not lead to good results. Also Wang et al. [49] report mixed results.

*Learning to Retrieve Relevant Information.* If we apply standard language models to mathematics, e.g., to predict the next proof step, we expect them to store all the information necessary to make good predictions in their parameters. As the large transformer models have shown (see, e.g., GPT [42,6]), this approach actually works pretty well for natural language question answering, and also for mathematical benchmarks it has been surprisingly successful [41,39,53]. However, there may be a limit to this approach in cases where we expect detailed, consistent, and up-to-date predictions. Guu et al. [17] introduced a hybrid of transformer and retrieval model, REALM, which learns to retrieve Wikipedia articles that are relevant to a given question and extract useful information from the article. REALM is trained self-supervised to retrieve multiple articles and try to use each of them individually to make predictions. The article that led to the best prediction is deemed to be the most relevant, and is used to train the retrieval query for future training iterations. This approach has been extended in follow-up work [33,2,34,3] and appears to be a promising approach also to retrieve the relevant context, such as definitions, possible premises, and even related proofs, for mathematical reasoning.

#### **2.3 Instant Utilization of New Premises**

Theorem proving has a key difference compared to other reinforcement learning settings: whenever we reach one of the goals, i.e., prove a theorem, we can use that goal as a premise for future proof attempts. Any learning method applied in a reinforcement learning setting for theorem proving thus needs the ability to adapt to this growing action space, and ideally does not need to be retrained at all when a new theorem becomes available to be used.

Premise selection approaches that are built on retrieval, such as DeepMath [24,36] and HOList [5,37], offer this ability: When a new theorem is proven, we can add it to the list of premises that can be retrieved and future retrieval queries can return the statement. This appears to work well, even when the provers are applied to a new set of theorems, as demonstrated by the DeepHOL prover when it was applied to the unseen Flyspeck theorem database [5]. We can even exploit this kind of generalization for exploration and bootstrap neural theorem provers without access to human proofs as training data [4].

A new challenge arises from the use of language models for theorem proving. Theorem provers using transformers currently have no dedicated retrieval module, and instead predict the statements or names of premises as part of the tactic string (cf. [39]). In our experience this does not provide the required generalization to unseen premises without retraining. (Though there are experiments that suggest that it might be possible [8].) Future approaches will have to find a way to combine the strong reasoning skills and generative abilities of Transformer models with the ability to use new premises without retraining.

#### **2.4 Natural Language**

We believe that, perhaps counterintuitively, natural language plays a central role in automated reasoning. The most direct reason is that only a small part of mathematics has been formalized so far, and a pragmatic approach to tap into much more training data is to find a way to learn from natural language mathematics (books and papers on mathematical topics). In this section, however, we want to look beyond the question of feasibility and training data, and discuss the broad advantages of a natural language approach to mathematics.

*Accessibility.* A bridge between natural and formal mathematics could help to make the system much more accessible, by not requiring the users to learn a specific formal language. This might open up mathematics to a much wider audience, enabling advanced mathematical assistants (think WolframAlpha [50]), and tools for education.

Vice versa, an advanced automatic mathematician without the ability to explain their reasoning in natural language might be hard to understand. Even if the system's predictions and theories are correct, sophisticated, and relevant, we might not be able to use them to inform our own understanding if the notions the system comes up with are only available as vast synthetic formal objects.

*Conjecturing, Theory Exploration, and Interestingness.* Various approaches have been suggested to produce new conjectures, including heuristic filters [40], deriving rules from data [7], and learning and sampling from a distribution of theorems using language modeling [41].

A particularly interesting idea is the use of adversarial training to generate conjectures (e.g., [12]). Here, two neural networks compete against each other one with the aim to prove statements and the other with the aim to suggest hard-to-prove statements, somewhat akin to generative adversarial nets [15]. The idea is that the competition between the two networks generates a curriculum of harder and harder problems to solve and also automatically explores new parts of mathematics (as old parts get easier over time). However, there seems to be a catch: Once the network that suggests problems has figured out how to define a one-way function, it becomes very easy to produce an unlimited number of hard problems, such as to find an input to the SHA256 function that produces a certain output hash. This class of problems is almost impossible to solve, and thus likely leads the process into a dead-end.

Once again, natural language seems to be a possible answer. Using the large body of natural language mathematics could help to equip machine learning models with a notion of what human mathematicians find *interesting*, and focus on these areas.

*Grounding Language Models.* Autoformalization does not only produce formal objects as a desired outcome, it also serves the dual purpose to improve language models. Checking the models' outputs and feeding back their correctness as a training signal would provide valuable grounding for their understanding.

Of course, the gap between formalized and informal mathematics is huge: it will likely require a considerable level of effort to automatically create high quality formalizations. Also, we believe that we will likely need a very high quality theorem prover to bootstrap any autoformalization system. However, recent progress in neural language processing [9,42], unsupervised translation [30,43] and also neural network based symbolic mathematics [29,41,18,39] makes this path seem increasingly feasible and appealing in the long run.

# **3 Conclusion**

In this extended abstract, we surveyed recent results in neural theorem proving and our mission to build an artificial mathematician, as well as some of the challenges on this path. While there is no guarantee that we can overcome these challenges, and there might be challenges that we cannot even anticipate yet, mere partial success to our mission could help the formal methods community with tools to simplify the formalization process, and impact adjacent areas, such as verification, program synthesis, and natural language understanding.

In a 2018 survey among AI researchers, the median prediction for when machines "routinely and autonomously prove mathematical theorems that are publishable in top mathematics journals today, including generating the theorems to prove" was in the 2060s [16]. However, over the last years, deep learning has already beaten a lot of expectations (at least ours) as to what is possible in automated reasoning. There are still several challenges to be solved, some of which we laid out in this abstract, but we believe that creating a truly intelligent artificial mathematician is within reach and will happen on a much shorter time frame than many experts expect.

# **References**


S., Paulin-Mohring, C., Pichardie, D. (eds.) Interactive Theorem Proving - 4th International Conference, ITP 2013, Rennes, France, July 22-26, 2013. Proceedings. Lecture Notes in Computer Science, vol. 7998, pp. 163–179. Springer (2013). https://doi.org/10.1007/978-3-642-39634-2 14


vol. 10383, pp. 292–302. Springer (2017). https://doi.org/10.1007/978-3-319-62075- 6 20


7-12, 2017. EPiC Series in Computing, vol. 46, pp. 85–105. EasyChair (2017), https://easychair.org/publications/paper/ND13


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Logical Foundations**

#### Tableau-based Decision Procedure for Non-Fregean Logic of Sentential Identity*-*

Joanna Goli´nska-Pilarek , Taneli Huuskonen , and Michal Zawidzki 2,3 1 1

<sup>1</sup> Faculty of Philosophy, University of Warsaw, 3 Krakowskie Przedmiescie St. 00-927 Warsaw, Poland

<sup>2</sup> Department of Computer Science, University of Oxford, Oxford OX1 3QD, UK

<sup>3</sup> Department of Logic, University of Lodz, 3/5 Lindleya St., 90-131 -L´od´z, Poland

> j.golinska@uw.edu.pl taneli@poczta.onet.pl michal.zawidzki@cs.ox.ac.uk

Abstract. Sentential Calculus with Identity (SCI) is an extension of classical propositional logic, featuring a new connective of identity between formulas. In SCI two formulas are said to be identical if they share the same denotation. In the semantics of the logic, truth values are distinguished from denotations, hence the identity connective is strictly stronger than classical equivalence. In this paper we present a sound, complete, and terminating algorithm deciding the satisfiability of SCIformulas, based on labelled tableaux. To the best of our knowledge, it is the first implemented decision procedure for SCI which runs in NP, i.e., is complexity-optimal. The obtained complexity bound is a result of dividing derivation rules in the algorithm into two sets: *decomposition* and *equality* rules, whose interplay yields derivation trees with branches of polynomial length with respect to the size of the investigated formula. We describe an implementation of the procedure and compare its performance with implementations of other calculi for SCI (for which, however, the termination results were not established). We show possible refinements of our algorithm and discuss the possibility of extending it to other non-Fregean logics.

Keywords: Sentential Calculus with Identity · non-Fregean logics · labelled tableaux · decision procedure · termination · computational complexity.

# 1 Introduction

In this paper, we present a decision procedure for the non-Fregean sentential calculus with identity SCI. The contribution of the paper is twofold. First of all, this is the first implemented and complexity-optimal decision procedure for

<sup>-</sup> Research reported in this paper is supported by the National Science Centre, Poland (grant number: UMO-2017/25/B/HS1/00503).

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 41 57, 2021. https://doi.org/10.1007/978-3-030-79876-5\_3 –

SCI, although several deduction systems for SCI have already been presented in the literature. Second, our decision procedure is constructed in the paradigm of labelled tableaux, which makes the whole approach more robust to modifications and extensions to other non-Fregean logics.

Non-Fregean logic is an alternative to both classical and many non-classical systems whose semantics identifies semantical correlates of sentences with their logical values. According to the classical approach in model theory, semantical structures (realities) correspond to the language that is meant to describe them, and therefore, symbols and expressions of that language, such as individual constants or relational symbols, have their denotations in these structures (respectively, objects or relations between objects). However, sentences are treated differently, as they are interpreted in models only in terms of logical values or other semantical relations such as satisfaction or truth. This classical approach allows us to answer the very basic logical question of whether the sentences are logically equivalent; however, it does not provide any tool that would allow to check whether the sentences describe or refer to the same situation, or have the same meaning. Thus, the main motivation for non-Fregean logic was the need for an extensional and two-valued logic that could be used to represent semantical denotations of sentences that – depending on the underlying philosophical theory of language or the reality to which a logic is supposed to refer – could be understood as situations, states of affairs, meanings, etc. In order to express (non)identities or other interactions between the referents of sentences, at least the universe of denotations of sentences needs to be added to the semantics and the new *identity* connective to the language.

The minimal two-valued non-Fregean propositional logic SCI (*Sentential Calculus with Identity*), introduced by Suszko (see [21]), is an extension of classical propositional logic with a new binary connective of identity (≡) and axioms reflecting its fundamental properties. The identity connective represents the identity of the denotations of sentences, and so, an expression 'ϕ ≡ ψ' should be read as 'the sentences ϕ and ψ describe the same «thing»'. The semantics for SCI is based on structures determined by a universe of the denotations of sentences, a set of facts (those denotations that actually hold), and operations corresponding to all the connectives. The identity connective is then interpreted as an operation representing an equivalence relation that additionally satisfies the extensionality property. In the non-Fregean approach the identity and equivalence connectives are in general not equivalent: two sentences with the same truth value can have different denotations. Take, for instance, the following three statements:


A, B, C are all (necessarily) true as theorems of mathematical logic. Therefore, they are pairwise logically equivalent, that is, all three equivalences: A ↔ B, B ↔ C, and A ↔ C hold. One can fairly claim that A and B refer to the same fact, so A ≡ B, but C has clearly a different semantic correlate than both A and B, as decidability is independent of Post consistency. Thus, we have A ≡ C and B ≡ C.

It is known that the class of all non-equivalent non-Fregean propositional logics satisfying the laws of classical logic is uncountable [7], and some of these logics are equivalent to the well-known non-classical logics (e.g., modal logics S4 and S5, many-valued logics). Higher-order non-Fregean logics are very expressive. In particular, a logic obtained from SCI by adding propositional quantifiers is undecidable and can express many mathematical theories, e.g., Peano arithmetic, the theory of groups, rings, and fields [8]. Furthermore, non-classical and deviant modifications of SCI have been developed and extensively studied in the literature, in particular intuitionistic logics [17,14,4], modal and epistemic logics [15,16], logics with non-classical identity [13], paraconsistent [6,9]. The non-Fregean approach could turn out to be more adequate than the classical one in cognitive science or natural language processing. Moreover, non-Fregean logic could serve as a general framework for comparing different aspects of logics with incompatible languages and semantics and help in addressing the question of which class of logics handles logical symbols in the most adequate way from the perspective of natural language.

In the original works by Suszko and Bloom the deduction system for SCI was defined in the Hilbert style [1,2]. Sound and complete deduction systems which are better suited for automated theorem proving were constructed later: Gentzen sequent calculi [18,22,23,3] and dual tableau systems [5,19,10]. A detailed presentation of all of them can be found in [10]. The main disadvantage of the aforementioned systems is that they are not decision procedures, while SCI is decidable and in particular in NP [2, Theorem 2.3]. Although the system by Wasilewska [22] can be seen as a meta-tool for deciding validity of SCI-formulas, it is equipped with external meta-machinery that is not a part of the system itself. As a result, it constitutes another proof for decidability of SCI, rather than being a decision procedure in the classical sense of the term, that is suitable for computer implementations. In [11] a tableau-based algorithm for SCI was presented as a work-in-progress. The decision procedure presented in this paper is a result of a substantial remodelling of the preliminary system introduced in [11], for which we prove soundness and completeness, present surprisingly straightforward proofs of termination and membership in NP, and provide an implementation.

In this paper, we present a new deduction system TSCI for the logic SCI, based on labelled tableaux. To the best of our knowledge, it is the first decision procedure for SCI. Moreover, its upper complexity bound, that is NP, matches the complexity class of the satisfiability problem for SCI, thus, making the algorithm complexity-optimal. TSCI is built in the paradigm of labelled tableaux. The language of deduction is an extension of the SCI-language with two sorts of labels representing the denotations of formulas (i.e., «facts» and «non-facts») as well as with the equality and the inequality relation that can hold between labels. (In)Equality formulas occurring in a derivation tree provide additional information on identity or distinctness of the denotations of formulas. In Section 2, we provide a formal overview of the logic SCI, in Section 3, we introduce the tableau algorithm TSCI and prove its soundness, completeness, and termination, establish that it is complexity-optimal with respect to SCI-satisfiability, and show a possible refinement thereof. In Section 4, we discuss an implementation of TSCI and compare it with an older prover based on a heuristic, unproven algorithm. Conclusions and directions of further research are presented in Section 5.

# 2 **SCI**

*Syntax* Let LSCI be a language of the logic SCI with the alphabet AF, ¬, →, ≡, where AF = {p, q,r,...} is a denumerable set of *atomic formulas*. The set FOR of SCI*-formulas* is defined by the following abstract grammar:

ϕ ::= p | ¬ϕ | ϕ → ϕ | ϕ ≡ ϕ,

where p ∈ AF.

*Axiomatization* The logic SCI is axiomatized by the following set of truthfunctional (1–3) and identity (4–8) axiom schemes:

1. ϕ → (ψ → ϕ) 2. (ϕ → (ψ → χ)) → ((ϕ → ψ) → (ϕ → χ)) 3. (¬ϕ → ¬ψ) → (ψ → ϕ) 4. ϕ ≡ ϕ 5. ϕ ≡ ψ → ¬ϕ ≡ ¬ψ 6. ϕ ≡ ψ → (χ ≡ θ → (ϕ → χ) ≡ (ψ → θ)) 7. ϕ ≡ ψ → (χ ≡ θ → (ϕ ≡ χ) ≡ (ψ ≡ θ)) 8. ϕ ≡ ψ → (ϕ → ψ).

*Semantics* Let U = ∅, D ⊂ U, and let ¬˜ : U −→ U, →˜ : U × U −→ U, and ≡˜ : U × U −→ U be functions on U. An SCI*-model* is a structure M = U, D, ¬˜, →˜ , ≡ ˜ , where U and D are called, respectively, *universe* and *set of designated values*, and the following conditions are satisfied for all a, b ∈ U:

$$
\forall a \in D \qquad \text{iff} \qquad a \notin D \tag{1}
$$

$$a \xrightarrow{\sim} b \in D \qquad \text{iff} \qquad a \notin D \text{ or } b \in D \tag{2}$$

$$a \overset{\simeq}{=} b \in D \qquad \text{iff} \qquad a = b. \tag{3}$$

A *valuation* in an SCI- model M = U, D, ¬˜, →˜ , ≡ ˜ is a function V : FOR −→ U such that for all ϕ, ψ ∈ FOR it holds that V (¬ϕ) = ¬˜V (ϕ) and V (ϕ#ψ) = <sup>V</sup> (ϕ)#˜ <sup>V</sup> (ψ), for # ∈ {→, ≡}. An element <sup>a</sup> <sup>∈</sup> <sup>U</sup> such that <sup>a</sup> <sup>=</sup> <sup>V</sup> (ϕ) is called the *denotation of* ϕ. Interestingly, SCI-model can be defined alternatively as a triple M = U, D, V , where a valuation V : FOR −→ U needs to satisfy the conditions analogous to (1)–(3) (for instance, V (¬ϕ) ∈ D iff V (ϕ) ∈/ D etc.). In the original approach V may as well be defined only for atomic formulas and then lifted up homomorphically to the set of all formulas, like in classical propositional logic. In the latter setting it is not the case, as a valuation defined solely for atoms does usually not have a unique extension to all formulas. We say that a formula ϕ is *satisfied* in an SCI-model M = U, D, ¬˜, →˜ , ≡ ˜ and a valuation V in M, and refer to it as M, V |=SCI ϕ, if its denotation belongs to D. We call a formula ϕ *satisfiable* if it is satisfied in some SCI-model by some valuation. We say that a formula ϕ is *true* in a model M = U, D, ¬˜, →˜ , ≡ ˜ , and refer to it as M |=SCI ϕ, whenever it is satisfied in M by all the valuations in M. We call a formula ϕ *valid*, and refer to it as |=SCI ϕ, if it is true in all SCI-models. Note that over the class of models where D and U \D are singletons SCI collapses to classical propositional logic. In fact all formulas which are SCI-instances of formulas valid in classical propositional are also valid in SCI. It suffices, however, to take a three-element model to tell ↔ and ≡ apart, as shown in the following example.

*Example 1.* Although the formula ¬¬p ↔ p is a tautology of classical propositional logic, the formula ¬¬p ≡ p is not valid in SCI. Indeed, consider an SCImodel <sup>M</sup> <sup>=</sup> U, D, <sup>¬</sup>˜,to, ˜ ≡ ˜ , where <sup>U</sup> <sup>=</sup> {0, <sup>1</sup>, <sup>2</sup>}, <sup>D</sup> <sup>=</sup> {1, <sup>2</sup>}, and the operations ¬˜, →˜ , ≡˜ are defined by:

$$
\tilde{\neg}a = \begin{cases} 0, & \text{if } a \neq 0, \\ 1, & \text{otherwise.} \end{cases} \quad a \xrightarrow{\sim} b = \begin{cases} 0, & \text{if } a \neq 2 \text{ and} \\ \begin{array}{l} b = 0, \\ 2, & \text{if } a = b, \\ 1, & \text{otherwise.} \end{array} \quad a \overset{\equiv}{=} b = \begin{cases} 0, & \text{if } a \neq b \\ a, & \text{if } a = b \text{ and } \\ & a \neq 0, \\ 1, & \text{otherwise.} \end{cases}
$$

It is easy to verify that such a structure is an SCI-model. Then, the following hold:


What is also characteristic of SCI is that identical formulas can be interchanged within other formulas with not only truth preservation, but also identity preservation. For instance, if p ≡ (p → q), then p ≡ ((p → q) → q), p ≡ (((p → q) → q) → q) and so on. On the other hand, identity of two formulas does not automatically yield identity of their subformulas. For example, if ¬p ≡ ¬q, it does not necessarily mean that p ≡ q. It is worth noting that in SCI we lack the usual equivalence between treating ∧, ∨, and ↔ as abbreviations involving ¬ and → and treating them as independent connectives whose mutual relations are established axiomatically. For instance, when ¬(ϕ → ¬ψ) is just a notational variant for ϕ∧ψ, then (ϕ∧ψ) ≡ ¬(ϕ → ¬ψ) is, of course, SCI-valid; however, it would not be the case if we regarded ∧ as a separate connective. Nevertheless, extending our results to other connectives introduced as independent logical constants is a matter of routine.

# 3 Tableaux

In this section, we provide a characterization of a sound, complete and terminating *labelled* tableau system for the logic SCI, which we call TSCI.

Let <sup>L</sup>+, <sup>L</sup><sup>−</sup> be countably infinite disjoint sets and let <sup>L</sup> <sup>=</sup> <sup>L</sup><sup>+</sup> <sup>∪</sup> <sup>L</sup>−. We will call an expression w : ϕ a *labelled formula*, where w ∈ L and ϕ ∈ FOR, and w will be called a *label*. We will abbreviate the set of all labelled formulas by LF. Any labels superscribed with '+' are restricted to belong to L<sup>+</sup> and labels superscribed with '−' to belong to L−. Labels without a superscript are not restricted. Intuitively, w stands for the denotation of ϕ in an intended model. Labels with '+' in the superscript denote elements of D, whereas labels with superscribed '−' represent elements of U\D. Thus, expressions of the form w = v or w = v reflect, respectively, the equality or distinctness of two denotations. By Id<sup>+</sup>, Id<sup>−</sup> we denote the sets of, respectively, all equalities and all inequalities of labels. Finally, we let Id <sup>=</sup> Id<sup>+</sup> <sup>∪</sup> Id−.

A *tableau* generated by the system for the logic SCI is a *derivation tree* whose nodes are assigned labelled formulas and (in)equality expressions. A simple path B from the root to a leaf in a tableau T is called *branch of* T . We will identify a branch B with the set of labelled formulas and (in)equalities occurring on B.

The rules of our tableau system have the following general form: <sup>Φ</sup> <sup>Ψ</sup>1|...|Ψ<sup>n</sup> , where Φ is the set of *premises* and each Ψi, for i ∈ {1,...,n}, is a set of *conclusions*. Intuitively, the '|' symbol should be read as a meta-disjunction. A rule with only one set of conclusions is called a *non-branching* rule. A rule with several sets of conclusions is a *branching rule*. In TSCI all rules where Ψi, for i ∈ {1,...,n} contain labelled formulas are called *decomposition rules*. All rules with a single equality statement as the conclusion are called *equality rules*. The remaining rules, in which ⊥ occurs as the conclusion, are referred to as *closure rules*. If we have a decomposition rule (R) with w : ϕ as its premise, then (R) is *applicable* to w : ϕ occurring on a branch B if it has not been applied to w : ϕ on B before. Otherwise w : ϕ is called (R)*-expanded* on B. For an equality rule (R) with Φ as the set of premises and w = v as the conclusion, (R) is applicable to Φ ⊆ B if w = v is not present on B. Otherwise Φ is (R)-expanded on B. Intuitively, if a set of premises Φ is (R)-expanded on B, then applying (R) to Φ would not add any new information to B.

A branch B of a tableau T is extended by applying rules of the system to sets of labelled formulas and (in)equality statements that are already on B. A label w is *present* on B if there exists a formula ϕ such that w : ϕ occurs on B. Otherwise w is *fresh* on B. A branch B is called *closed* if one of the closure rules has been applied to it, that is, when an inconsistency occurs on B. A branch that is not closed, is *open*. A branch B is *fully expanded* if it is closed or no rules are applicable on it. A tableau T is called *closed* if all of its branches are closed. Otherwise T is called *open*. We call T fully expanded if all of its branches are fully expanded.

Analytic tableaux are satisfiability checkers, so a *tableau proof* of a formula ϕ is a closed tableau with a labelled formula w<sup>−</sup> : ϕ at its root. A formula is *tableauvalid* if all tableaux with w<sup>−</sup> : ϕ at the root are closed. On the other hand, a

$$(\neg^{+})\quad \frac{w^{+} : \neg \varphi}{v^{-} : \varphi} \qquad (\neg^{-})\quad \frac{w^{-} : \neg \varphi}{v^{+} : \varphi}$$

$$\begin{array}{ccccc} \left(\rightarrow^{+}\right) & \frac{w^{+} : \varphi \rightarrow \psi}{\upsilon^{-} : \varphi \quad \left| \quad \upsilon^{-} : \varphi \quad \right| \quad \upsilon^{+} : \varphi \\\ & u^{-} : \psi \quad \left| \quad u^{+} : \psi \quad \right| \quad u^{+} : \psi \end{array} \qquad \begin{array}{c} \left(\rightarrow^{-}\right) & \frac{w^{-} : \varphi \rightarrow \psi}{\upsilon^{+} : \varphi} \\\ & u^{+} : \varphi \\\ & u^{-} : \psi \end{array}$$

$$\begin{array}{llllll} \left(\mathsf{m}^{+}\right) & \frac{w^{+}:\varphi\equiv\psi}{\upsilon^{+}:\varphi\mathsf{p}} & \left(\mathsf{m}^{-}\right) & \frac{w^{-}:\varphi\equiv\psi}{\upsilon^{+}:\varphi\mathsf{p}}\\ & u^{+}:\psi & u^{-}:\psi\\ & v^{+} = u^{+} & v^{-} = u^{-} & v^{+} \neq u^{+} & u^{-}:\psi & u^{+}:\psi\\ \end{array} \\ \begin{array}{llll} & w^{-}:\varphi \equiv \psi\\ & u^{+}:\varphi & \left|\ \begin{array}{c} v^{-}:\varphi\\ \psi^{+}:\varphi \end{array}\right| & v^{-}:\varphi\\ & v^{+} \neq u^{-} \end{array} \\ \begin{array}{llll} & w^{-}:\varphi\\ & u^{-}:\psi\\ \end{array} & \left|\ \begin{array}{c} v^{-}:\varphi\\ & u^{-}:\psi\\ & v^{-} \neq u^{-} \end{array}\right| \\ \end{array}$$

$$\begin{array}{ccccc} & \varphi \approx \psi & \begin{array}{c} \varphi \approx \psi \\ \chi \mathrel{\mathop{\!\!\!\! }} \end{array} & \begin{array}{c} \varphi \approx \psi \\ \chi \mathrel{\mathop{\!\!\!\!\!\!\/\/\/\/\/\/}} \end{array} & \begin{array}{c} \begin{array}{c} \varphi \approx \psi \\ \chi \mathrel{\mathop{\!\!\!\!\/}} \end{array} \\ & x : \varphi \mathrel{\mathop{\!\!\!\/\/\/\/\/}} \end{array} & \begin{array}{c} \begin{array}{c} \varphi \approx \psi \\ x : \psi \mathrel{\mathop{\!\!\!\/}} \end{array} \\ \begin{array}{c} \begin{array}{c} x : \psi \mathrel{\mathop{\!\!\!\/}} \end{array} \end{array} & \begin{array}{c} \begin{array}{c} \begin{array}{c} \varphi \approx \psi \\ x : \psi \end{array} \end{array} \\ \begin{array}{c} \begin{array}{c} x : \psi \mathrel{\mathop{\!\!\!\/}} \end{array} \end{array} & \begin{array}{c} \begin{array}{c} \begin{array}{c} x : \psi \end{array} \end{array} \end{array} \end{array}$$

$$\begin{array}{ccccc} & & w = v & & w = v \\ \text{(sym)} & \frac{w = v}{v = w} & \text{(tran)} & \frac{v = u}{w = u} & \text{( $\perp\_1$ )} & \frac{w \neq v}{\bot} & \text{( $\perp\_2$ )} & \frac{w^+ = v^-}{\bot} \end{array}$$

<sup>1</sup> Labels occurring in conclusions of the rules: (¬<sup>+</sup>), (¬<sup>−</sup>), (→<sup>+</sup>), (→<sup>−</sup>), (≡<sup>+</sup>), (≡<sup>−</sup>) are fresh on the branch.

<sup>2</sup> The abbreviation <sup>ϕ</sup> <sup>≈</sup> <sup>ψ</sup> represents the set of three preconditions: <sup>w</sup> : <sup>ϕ</sup>, <sup>v</sup> : <sup>ψ</sup>, w = v, for some w, v ∈ L. Similarly for χ ≈ θ.

#### Fig. 1. Tableau system TSCI

formula ϕ is *tableau-satisfiable* if there exists an open and fully expanded tableau with a labelled formula w<sup>+</sup> : ϕ at its root. Note that our notion of tableausatisfiability matches the usual notion of satisfiability as a failure of finding a proof. Indeed, if a formula ϕ is not tableau-valid, that is, there exists a tableau with w<sup>−</sup> : ϕ at the root which has an open branch, then ¬ϕ is tableau-satisfiable. Thus, the standard duality between validity and satisfiability is reflected in the concepts of tableau-validity and tableau-satisfiability.

#### 3.1 Tableau System for **SCI**

The rules presented in Figure 1 constitute the tableau system TSCI for the logic SCI. The decomposition rules (¬<sup>+</sup>), (¬<sup>−</sup>), (→<sup>+</sup>), (→<sup>−</sup>), (≡<sup>+</sup>), (≡<sup>−</sup>) reflect the semantics of ¬, → and ≡ defined in the conditions 1–3 from Section 2. Note that an application of any of these rules introduces to a branch fresh labels for each of the subformulas into which the premise formula is decomposed. By that means, all occurrences of subformulas of the input formula ϕ are assigned their unique labels. A few words of extra commentary on the rule (≡−) are in order. It decomposes a formula involving the ≡ connective, which is assumed to be false. By the semantics of ≡ we know that the constituents of the initial ≡-formula have distinct denotations. If these denotations have different polarities, representing different truth values (disjuncts 2 and 3 in the denominator of the rule), then no additional information has to be stored about the distinctness of these denotations. If, on the other hand, the denotations have the same polarity, representing the same truth value (disjuncts 1 and 4 in the denominator of the rule), then extra information is added, namely that the denotations of both formulas are distinct. The rules (≡<sup>¬</sup>), (≡<sup>→</sup>) and (≡<sup>≡</sup>) are tableau-counterparts of the axioms 5, 6, and 7, respectively. The rule (F) ensures that a valuation that can be read off from an open branch is a function, i.e., that all denotations assigned to the same formula on a branch are equal. The rules (sym) and (tran) guarantee that equalities appearing on a branch preserve all properties of the =-relation. Note that an application of a closure rule to a branch is always a result of transformations of equality statements. While executing TSCI we always apply closure rules eagerly, that is, whenever a closure rule can be applied, it should be applied. An example of a tableau proof generated by TSCI can be found in Figure 2.

The tableau system TSCI is a userfriendly and elegant solution to the problem most non-labelled systems for SCI struggle with, namely substitutability of identical formulas within other formulas with identity preservation. In a derivation that can result in yielding conclusions of greater complexity than premises, as shown at the end of Section 2. It often leads to a loss of subformula property in a deduction system. TSCI, on the other hand, reduces the whole reasoning to a simple equality calculus where only identities or non-identities between labels are substantial for the result of a given derivation. It allows us to circumvent the abovementioned problem by replacing it with a question: are labels representing given formulas equal or distinct?

w<sup>−</sup> : ϕ ≡ ψ → (ϕ → ψ) <sup>v</sup><sup>+</sup> : <sup>ϕ</sup> <sup>≡</sup> <sup>ψ</sup> u<sup>−</sup> : ϕ → ψ x<sup>+</sup> : ϕ y<sup>−</sup> : ψ z<sup>+</sup> : ϕ t <sup>+</sup> : ψ z<sup>+</sup> = t + y<sup>−</sup> = t + ⊥ (⊥2) (F) (≡<sup>+</sup>) z<sup>−</sup> : ϕ t <sup>−</sup> : ψ z<sup>−</sup> = t − x<sup>+</sup> = z<sup>−</sup> ⊥ (⊥2) (F) (→<sup>−</sup>) (→<sup>−</sup>)

Fig. 2. Tableau proof for the axiom ϕ ≡ ψ → (ϕ → ψ)

#### 3.2 Soundness and Completeness<sup>4</sup>

First, we will prove soundness of the tableau system TSCI.

<sup>4</sup> A technical appendix to the paper with all omitted proofs can be found in [12]

Let A, B be finite sets such that A ⊆ LF and B ⊆ Id. A set A ∪ B is said to be satisfied in an SCI-model M = U, D, ¬˜, →˜ , ≡ ˜ by a valuation V in M and a function f : L −→ U if and only if the following hold: (1) V (ϕ) = f(w), for all <sup>w</sup> <sup>∈</sup> <sup>L</sup> and <sup>ϕ</sup> <sup>∈</sup> FOR such that <sup>w</sup> : <sup>ϕ</sup> <sup>∈</sup> <sup>A</sup>, (2) <sup>f</sup>(w) <sup>∈</sup> <sup>D</sup> iff <sup>w</sup> <sup>∈</sup> <sup>L</sup>+, for all labels w that occur in A ∪ B, (3) f(w) = f(v), for all w, v ∈ L such that w = v ∈ B, (4) f(w) = f(v), for all w, v ∈ L such that w = v ∈ B. A set A ∪ B is said to be SCI-satisfiable whenever there exist an SCI-model M = U, D, ¬˜, →˜ , ≡ ˜ , a valuation V in M, and a function f : L −→ U such that A ∪ B is satisfied in M by V and f.

Proposition 1. *For every satisfiable* SCI*-formula* <sup>ϕ</sup> *and for all* <sup>w</sup><sup>+</sup> <sup>∈</sup> <sup>L</sup><sup>+</sup> *it holds that* {w<sup>+</sup> : <sup>ϕ</sup>} *is* SCI*-satisfiable.*

Proposition 2. *For all* w, v <sup>∈</sup> <sup>L</sup>*,* <sup>w</sup><sup>+</sup> <sup>∈</sup> <sup>L</sup>+*, and* <sup>v</sup><sup>−</sup> <sup>∈</sup> <sup>L</sup>−*, and for all finite* <sup>X</sup> <sup>⊆</sup> LF <sup>∪</sup> Id*, the sets* <sup>X</sup> ∪ {<sup>w</sup> <sup>=</sup> v, w <sup>=</sup> <sup>v</sup>} *and* <sup>X</sup> ∪ {w<sup>+</sup> <sup>=</sup> <sup>v</sup><sup>−</sup>} *are not* SCI*satisfiable.*

Let (R) <sup>Φ</sup> <sup>Ψ</sup>1|...|Ψ<sup>n</sup> , for <sup>n</sup> <sup>≥</sup> <sup>1</sup>, be a decomposition or equality rule of the tableau system TSCI. A rule (R) is referred to as *sound* whenever, for every finite set X ⊆ LF∪Id, it holds that X ∪Φ is SCI-satisfiable iff X ∪Φ∪Ψ<sup>i</sup> is SCI-satisfiable for some i ∈ {1,...,n}.

Proposition 3. *Decomposition and equality rules of the tableau system* TSCI *are sound.*

Theorem 1 (Soundness). *The tableau system* TSCI *is sound, that is, if an* SCI *formula* ϕ *is satisfiable, then* ϕ *is tableau-satisfiable.*

*Proof.* We prove the contrapositive. Let <sup>T</sup> be a closed <sup>T</sup>SCI-tableau with <sup>w</sup><sup>+</sup> : <sup>ϕ</sup> at its root. Then, each branch of <sup>T</sup> contains either <sup>w</sup><sup>+</sup> <sup>=</sup> <sup>v</sup><sup>−</sup> or both <sup>w</sup> <sup>=</sup> <sup>v</sup> and <sup>w</sup> <sup>=</sup> <sup>v</sup>, for some w, v <sup>∈</sup> <sup>L</sup>, <sup>w</sup><sup>+</sup> <sup>∈</sup> <sup>L</sup><sup>+</sup>, <sup>v</sup><sup>−</sup> <sup>∈</sup> <sup>L</sup>−. By Proposition 2, both sets <sup>X</sup> ∪ {w<sup>+</sup> <sup>=</sup> <sup>v</sup><sup>−</sup>} and <sup>X</sup> ∪ {<sup>w</sup> <sup>=</sup> v, w <sup>=</sup> <sup>v</sup>} are not SCI-satisfiable, for any finite set X ⊆ LF∪Id. By Proposition 3, each application of TSCI-rules preserves SCI-satisfiability. Hence, going from the bottom to the top of the tree T , on each step of the construction of TSCI-tableau we get SCI-unsatisfiable sets. Thus, we can conclude that w<sup>+</sup> : ϕ is not SCI-satisfiable, and thus by Proposition 1 we obtain that ϕ is not SCI-satisfiable. Therefore, each satisfiable SCI-formula ϕ is tableau-satisfiable.

To prove completeness of the system TSCI we need to show that if, for a given formula ϕ, TSCI does not yield a tableau proof, then ϕ is not valid, i.e., there exists a countermodel M = U, D, V such that M |= ϕ.

Suppose that we want to obtain a tableau-proof for a formula ϕ. To that end, we run the TSCI-tableau algorithm with a labelled formula **w**<sup>−</sup> : ϕ at the root of the tableau, for **w**<sup>−</sup> ∈ L−. Suppose that it yields an open tableau as a result. It means that the tableau contains an open and fully expanded branch B. We will demonstrate how to construct a structure M<sup>B</sup> = U, D, ¬˜, →˜ , ≡ ˜ using information stored on B and show that it actually is an SCI-countermodel falsifying ϕ. Let L<sup>+</sup> <sup>B</sup> be the set of all labels superscribed with '+' occurring on B, let L<sup>−</sup> <sup>B</sup> be the set of all labels superscribed with '−' occurring on <sup>B</sup> and let <sup>L</sup><sup>B</sup> <sup>=</sup> <sup>L</sup><sup>+</sup> <sup>B</sup> <sup>∪</sup> <sup>L</sup><sup>−</sup> <sup>B</sup> . Moreover, let FOR<sup>B</sup> be the set of all SCI-formulas <sup>ϕ</sup> such that w : ϕ occurs on B, for some w ∈ LB. Note that all elements of FOR<sup>B</sup> are subformulas of ϕ. Before we characterize the construction of MB, we define a binary relation ∼⊆ L<sup>B</sup> × L<sup>B</sup> in the following way:

w ∼ v iff w = v occurs on B.

Proposition 4. *The relation* <sup>∼</sup> *is an equivalence relation and* (L<sup>+</sup> <sup>B</sup> <sup>×</sup>L<sup>−</sup> <sup>B</sup> )∩∼ <sup>=</sup> <sup>∅</sup>*.*

Let ML<sup>+</sup> <sup>B</sup> be a set resulting from choosing exactly one label from each element of (L<sup>+</sup> <sup>B</sup> )/∼. Sets ML<sup>−</sup> <sup>B</sup> and ML<sup>B</sup> are defined analogically with the assumption that **w**<sup>−</sup> ∈ ML<sup>−</sup> <sup>B</sup> , where **<sup>w</sup>**<sup>−</sup> is such that **<sup>w</sup>**<sup>−</sup> : <sup>ϕ</sup> is at the root of an open tableau. Of course, neither of these sets is uniquely determined.

Proposition 5. *For all* ψ ∈ FOR *and* w, v ∈ L<sup>B</sup> *the following holds:*

*if both* w : ψ *and* v : ψ *belong to* B*, then* w ∼ v*.*

We say that w ∈ ML<sup>B</sup> is (¬)*-closed* whenever there are ψ ∈ FOR, u ∈ MLB, and v, t ∈ L<sup>B</sup> such that w ∼ v, u ∼ t and labelled formulas v : ψ, t : ¬ψ belong to B. Let w, v ∈ ML<sup>B</sup> and # ∈ {→, ≡}. The pair (w, v) is said to be (#)*-closed* whenever there exist ψ, θ ∈ FOR, u ∈ MLB, and t, x, y ∈ L<sup>B</sup> such that w ∼ t, v ∼ x, u ∼ y and labelled formulas t : ψ, x : θ, y : (ψ#θ) occur on the branch B.

The *branch structure* M<sup>B</sup> = U, D, ¬˜, →˜ , ≡ ˜ is defined as follows:

– <sup>D</sup> <sup>=</sup> {w<sup>+</sup> <sup>|</sup> <sup>w</sup><sup>+</sup> <sup>∈</sup> ML<sup>+</sup> <sup>B</sup> }∪{**w**<sup>+</sup>}, where **<sup>w</sup>**<sup>+</sup> <sup>∈</sup>/ <sup>L</sup><sup>B</sup> – U = D ∪ ML<sup>−</sup> B .

It follows from the above that U \D = ML<sup>−</sup> <sup>B</sup> . The operations <sup>¬</sup>˜, <sup>→</sup>˜ , <sup>≡</sup>˜ are defined for all w, v ∈ U in the following way:

¬˜w df = ⎧ ⎪⎨ ⎪⎩ u ∈ U, if there are ψ ∈ FOR and v, t ∈ L<sup>B</sup> such that w = v, u = t, v : ψ, and t : ¬ψ are on B **<sup>w</sup>**<sup>+</sup>, if <sup>w</sup> is not (¬)-closed and <sup>w</sup> ∈ <sup>D</sup> **w**−, otherwise w→˜ v df = ⎧ ⎪⎪⎪⎨ ⎪⎪⎪⎩ u ∈ U, if there are ψ, θ ∈ FOR and t, x, y ∈ L<sup>B</sup> such that w = t, v = x, u = y, t : ψ, x : θ, and y : (ψ → θ) are on B **<sup>w</sup>**<sup>+</sup>, if <sup>v</sup> <sup>=</sup> **<sup>w</sup>**<sup>+</sup> or both (<sup>w</sup> <sup>=</sup> **<sup>w</sup>**<sup>+</sup> and <sup>v</sup> <sup>∈</sup> <sup>D</sup>), or it holds that (w, v) is not (→)-closed and either w ∈ D or v ∈ D **w**−, otherwise w≡˜ v df = ⎧ ⎪⎪⎪⎨ ⎪⎪⎪⎩ u ∈ U, if there are ψ, θ ∈ FOR and t, x, y ∈ L<sup>B</sup> such that w = t, v = x, u = y, t : ψ, x : θ, and y : (ψ ≡ θ) are on B **w**<sup>+</sup>, if w = v and either w = **w**<sup>+</sup> or the pair (w, v) is not (≡)-closed **w**−, otherwise

Due to the properties of the sets ML<sup>+</sup> <sup>B</sup> and ML<sup>−</sup> <sup>B</sup> , we obtain: Proposition 6. *The sets* D *and* U \ D *are non-empty and* D ∩ (U \ D) = ∅*.*

The following series of results ensure that the operations ¬˜, →˜ , and ≡˜ reflect the semantics of SCI.

Proposition 7. ¬˜ *is a function on* U *and for all* w ∈ U*:*

(∗) ˜¬w ∈ D *iff* w ∈ D*.*

Proposition 8. →˜ *is a function on* U *and for all* w, v ∈ U*, the following holds:*

(∗) w→˜ v ∈ D *iff* w ∈ D *or* v ∈ D*.*

Proposition 9. ≡˜ *is a function on* U *and for all* w, v ∈ U *the following holds:*

(∗) w≡˜ v ∈ D *iff* w = v*.*

Propositions 6–9 imply:

Proposition 10. *The structure* M<sup>B</sup> *is an* SCI*-model.*

In what follows, the structure M<sup>B</sup> will be referred to as *branch model*.

Now, let V : FOR −→ U be a function such that for all p ∈ AF:

<sup>V</sup> (p) = - u ∈ MLB, if there is w ∈ L<sup>B</sup> such that w : p ∈ B and w ∼ u **w**<sup>+</sup>, otherwise

and for all ψ, θ ∈ FOR the following hold:

V (¬ψ)=˜¬V (ψ) <sup>V</sup> (ψ#θ) = <sup>V</sup> (ψ)#˜ <sup>V</sup> (θ), for # ∈ {→, ≡}.

Proposition 11. *The function* V *is well defined and it is a valuation in* MB*.*

Proposition 12. *For all* ψ ∈ FOR *and* w ∈ L<sup>B</sup> *it holds that:*

(∗) *If* w : ψ ∈ B*, then* w ∼ V (ψ)*.*

Theorem 2 (Completeness). *The tableau system* TSCI *is complete, that is, if a formula* ϕ *is* SCI*-valid, then* ϕ *has a tableau proof.*

*Proof.* Let ϕ be a valid SCI-formula. Suppose that ϕ does not have a tableau proof. Then, each TSCI-tableau with **w**<sup>−</sup> : ϕ at its root is open. Let B be an open and fully expanded branch of an open tableau for **w**<sup>−</sup> : ϕ. By Proposition 10, the structure M<sup>B</sup> = U, D, ¬˜, →˜ , ≡ ˜ is an SCI-model. Let V be a valuation in M<sup>B</sup> defined as before Proposition 11 Then, by Proposition 12, **w**<sup>−</sup> ∼ V (ϕ), and hence V (ϕ) ∈ D. Thus, ϕ is not true in MB, which contradicts the assumption that ϕ is SCI-valid.

#### 3.3 Termination

It turns out that the system presented in Section 3.1 terminates without any external blocking mechanisms involved which would impose some additional restrictions on rule-application. The only caveat that has to be added to the system is the one that we have already expressed, namely that no rule (R) can be applied to the set of premises that is (R)-expanded.

Theorem 3. *The tableau system* TSCI *is terminating.*

*Proof.* The argument hinges on two observations. First, the decomposition rules are the only rules that introduce fresh labels to a branch B of a TSCI-tableau T , and, as mentioned before, on a branch B each occurrence of a subformula of the initial formula ϕ is assigned its unique label. Thus, since an application of any of the above rules decreases the complexity of the processed formula and the rule cannot be applied twice to the same premise, the total number of labels occurring on a branch does not exceed the size of ϕ measured as the number of all occurrences of subformulas of ϕ (henceforth denoted by |ϕ|). Secondly, the equality rules can only add equalities between labels to a branch, provided that such an equality statement is not already present thereon. The maximal number of such equalities is quadriatic in the total number of labels occurring on a branch. Thus, for each SCI-formula ϕ, on any branch B of a TSCI-tableau for ϕ, rules are applied at most |ϕ|+|ϕ| <sup>2</sup> + 1 times, where '1' in the formula represents an application of a closure rule. This makes the whole derivation finite.

Corollary 1. *For each* SCI*-formula* ϕ *every branch* B *of a* TSCI*-tableau derivation for* ϕ *is of polynomial size with respect to the size of* ϕ*.*

Since SCI contains classical propositional logic, it inherits the NP-lower bound for the satisfiability problem therefrom. Together with membership of SCI-satisfiability in NP it gives the following:

Theorem 4. TSCI *is a complexity-optimal decision procedure for the* NP*-complete problem of* SCI*-satisfiability.*

*Proof.* Immediate from Corollary 1 and the fact that each branching rule of TSCI is finitely branching.

## 3.4 Limiting the Number of Labels

To boost the performance of the system TSCI we propose a refinement thereof. It consists in limiting the number of fresh labels introduced to a tableau by decomposition rules by introducing an additional condition called *urfather blocking*

Given a formula ϕ for which we construct a TSCI-tableau T , for each subformula ψ of ϕ, let's call the first occurrence of a labelled formula w : ψ on a branch B of T the ψ*-urfather on* B. The system TSCI + (UB) (*tableau system for* SCI *with urfather blocking*) is composed of the rules of TSCI and an additional constraint:

(UB) For each labelled formula w : ϕ that occurs on a branch B, no decomposition rule can be applied to w : ϕ unless it is the ϕ-urfather on B.

It turns out that augmenting TSCI with (UB) does not lead to any unwanted consequences such as giving up the completeness.

Proposition 13. *For every* SCI*-formula* ϕ*, if* ϕ *has a* TSCI*-tableau proof, then* ϕ *has* TCSCI + (UB)*-tableau proof.*

Theorem 5. TSCI + (UB) *is sound, complete, terminating, and complexity-optimal for* SCI*-satisfiability.*

*Proof.* The soundness of TSCI + (UB) straightforwardly follows from the soundness of TSCI and the fact that both systems share the full set of rules. The argument for termination of TSCI + (UB) and complexity-optimality of TSCI + (UB) for SCI-satisfiability goes along the same lines as the proofs of Theorems 3 and 4, and rests on the fact that, for each formula ϕ, a TSCI + (UB)-tableau contains at most as many labels as a TSCI-tableau. The completeness of

TSCI + (UB) is a direct consequence of Proposition 13 and Theorem 2.

#### 4 Implementation

#### 4.1 Overview

We have written proof-of-concept type implementations of the labelled tableau system described in the present article and its variant with urfather blocking, as well as a dual-tableau-based theorem prover for SCI based on the system from [5]. Since the last system does not enjoy the termination property, the implementation relies on heuristics in this respect. All three provers are implemented in the Haskell language using similar programming techniques in a casual manner, without any serious attempt to optimize the code or to test it extensively, as the programs are only intended as temporary aids to ongoing research.

In testing, the labelled-tableau provers turned out to need drastically more computing resources even in many quite modest test cases. For instance, the axiom ((p ≡ q) ∧ (r ≡ s)) → ((p ≡ r) ≡ (q ≡ s)) generates a labelled tableau of depth 37 consisting of 619 nodes, which urfather blocking reduces to depth 33 and 555 nodes, while the tree of the dual-tableau prover has depth 18 and only 67 nodes. The difference appears to be mostly due to the large branching factor of the identity rules of the labelled-tableau system. However, in some test cases the labelled-tableau system yields a smaller tree than the other prover. In general, the labelled tableau method seems to tolerate relatively well formulas consisting of a large number of very simple identitities.

#### 4.2 Technical Notes

Unlike the abstract tree described above, each node of which contains only a single labelled formula, each node of the tree built by the program contains a list of all the labelled formulas encountered so far on the branch. This allows the program to freely manipulate the list to keep track of what rules have already been applied to which formulas. There are three main types of nodes: normal nodes, identity nodes, and leaves. First, the decomposition rules are applied in normal nodes. Once they have been applied to exhaustion, the tree is extended with identity nodes, in which the identity rules are applied. At any point, one of the closure rules (⊥1) or (⊥2) can be applied to append a special closure leaf node. An open leaf node is appended whenever there are no more rules to apply in an identity node and the branch remains open.

#### 4.3 Test Results

We found a randomly generated provable SCI-formula that turned out to be somewhat challenging to an earlier prover. The formula, which we will call the ϕ here, looks as follows:

$$\begin{array}{c} (((\mathfrak{q} \equiv \mathfrak{p}) \rightarrow (\mathfrak{p} \rightarrow \mathfrak{r})) \equiv ((\mathfrak{p} \rightarrow (\mathfrak{p} \leftrightarrow \mathfrak{p})) \equiv \mathfrak{p})) \\ \rightarrow (((\mathfrak{r} \wedge \mathfrak{p}) \leftrightarrow (\mathfrak{p} \equiv \mathfrak{p})) \vee ((\mathfrak{p} \wedge \mathfrak{p}) \vee \neg \mathfrak{q})) \end{array}$$

We denote by ψ the formula obtained by replacing each occurrence of p in ϕ by ϕ itself. We defined a provability-preserving transformation T that turns an SCI-formula into a Horn clause consisting of very simple identities.

We present the results of attempting to prove the formulas ϕ, ¬ϕ, ψ, ¬ψ, T(ϕ), and T(¬ϕ). These are chosen to illustrate some of the variety of outcomes we observed. As noted above, ϕ is provable, and therefore also ψ and T(ϕ) are provable. The results are of the form *depth/size,* where *depth* is the maximal branch length and *size* is the number of nodes in the entire tree. There are entries for the dual-tableau-based prover (DTSCI), the current labelled-tableau prover (TSCI), and the same with the urfather blocking condition (TSCI + (UB)). Several entries are missing due to exhaustion of memory (the programs were tested on a machine with 8GB of RAM; adding several gigabytes of swap space did not make a difference).


# 5 Conclusions

In this paper we introduced the system TSCI which is the first complexity-optimal decision procedure for the logic SCI devised in the paradigm of labelled tableaux. TSCI is conceptually simple and directly reflects the semantics of the logic. The reasoning performed in TSCI has two components: decomposition and equality reasoning. Interestingly, it is the latter that is responsible for closing tableau branches, and thus, yielding tableau proofs for formulas. In this respect TSCI is based on similar conceptual foundations as calculi generated by the tableausynthesis framework from [20].We provided an implementation of TSCI and a variant with urfather blocking, and we compared their performance with the performance of another implemented deduction system for SCI which has not been proven to be terminating or complete. There was no unique winner; the new system was better at dealing with formulas with complex networks of identities, while the old, unproven system handled other types of formulas better. Urfather blocking yielded modest reductions in depth and total size.

In future research we want to address three main problems. First, we would like to optimize our tableau algorithm by introducing further refinements to it, such as decreasing the branching factor of the rule (→<sup>+</sup>) and, by that means, making it "information-deleting". Some prelimiary results on the implementation of <sup>T</sup>SCI with the modified rule (→<sup>+</sup>) show a promising reduction of the size of generated tableaus. Moreover, we plan to search for heuristics and ruleapplication strategies which would, too, allow to minimize the size of tableaux yielded by TSCI for certain classes of formulas. It seems that it is not always necessary to fully decompose the input formula before performing any equality reasoning, if a contradiction is to be reached on a branch. Secondly, we would like to develop the dual-tableau systems from [5] and [10] to full-fledged decision procedures, implement them, and compare the performance of all three algorithms on an extensive set of various SCI-formulas. Thirdly, we intend to extend the labelled tableaux-based approach presented in this paper to other non-Fregean logics, both classical (such as modal non-Fregean logics) and deviant (such as intuitionistic or many-valued non-Fregean logics, or Grzegorczyk's logic). Finally, we would like to take a closer look at various normal forms of SCI formulas, one of which was mentioned in Section 4, and decide in what cases it pays off to transform a formula into a normal form before running a decision procedure, rather than running it directly on the initial formula.

#### References


23. Wasilewska, A.: DFC-algorithms for Suszko logic and one-to-one Gentzen type formalizations. Studia Logica 43(4), 395–404 (1984). https://doi.org/10.1007/BF00370509

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Learning from Łukasiewicz and Meredith: Investigations into Proof Structures

Christoph Wernhard<sup>1</sup> and Wolfgang Bibel<sup>2</sup>

<sup>1</sup> Berlin, Germany info@christophwernhard.com <sup>2</sup> Technical University Darmstadt, Darmstadt, Germany bibel@gmx.net

Abstract. The material presented in this paper contributes to establishing a basis deemed essential for substantial progress in Automated Deduction. It identifies and studies global features in selected problems and their proofs which offer the potential of guiding proof search in a more direct way. The studied problems are of the wide-spread form of "axiom(s) and rule(s) imply goal(s)". The features include the well-known concept of lemmas. For their elaboration both human and automated proofs of selected theorems are taken into a close comparative consideration. The study at the same time accounts for a coherent and comprehensive formal reconstruction of historical work by Łukasiewicz, Meredith and others. First experiments resulting from the study indicate novel ways of lemma generation to supplement automated first-order provers of various families, strengthening in particular their ability to find short proofs.

## 1 Introduction

Research in Automated Deduction, also known as Automated Theorem Proving (ATP), has resulted in systems with a remarkable performance. Yet, deep mathematical theorems or otherwise complex statements still withstand any of the systems' attempts to find a proof. The present paper is motivated by the thesis that the reason for the failure in more complex problems lies in the local orientedness of all our current methods for proof search like resolution or connection calculi in use.

In order to find out more global features for directing proof search we start out here to study the structures of proofs for complex formulas in some detail and compare human proofs with those generated by systems. Complex formulas of this kind have been considered by Łukasiewicz in [19]. They are complex in the sense that current systems require tens of thousands or even millions of search steps for finding a proof if any, although the length of the formulas is very short indeed. How come that Łukasiewicz found proofs for those formulas although he could never carry out more than, say, a few hundred search steps by hand? Which global strategies guided him in finding those proofs? Could we discover such strategies from the formulas' global features?

By studying the proofs in detail we hope to come closer to answers to those questions. Thus it is proofs, rather than just formulas or clauses as usually in

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5\_4 58–75, 2021.

ATP, which is in the focus of our study. In a sense we are aiming at an ATPoriented part of Proof Theory, a discipline usually pursued in Logic yet under quite different aspects. This meta-level perspective has rarely been taken in ATP for which reason we cannot rely on the existing conceptual basis of ATP but have to build an extensive conceptual basis for such a study more or less from scratch.

This investigation thus analyzes structures of, and operations on, proofs for formulas of the form "axiom(s) and rule(s) imply goal(s)". It renders condensed detachment, a logical rule historically introduced in the course of studying these complex proofs, as a restricted form of the Connection Method (CM) in ATP. All this is pursued with the goal of enhancing proof search in ATP in mind. As noted, our investigations are guided by a close inspection into proofs by Łukasiewicz and Meredith. In fact, the work presented here amounts at the same time to a very detailed reconstruction of those historical proofs.

The rest of the paper is organized as follows: In Sect. 2 we introduce the problem and a formal human proof that guides our investigations and compare different views on proof structures. We then reconstruct in Sect. 3 the historical method of condensed detachment in a novel way as a restricted variation of the CM where proof structures are represented as terms. This is followed in Sect. 4 by results on reducing the size of such proof terms for application in proof shortening and restricting the proof search space. Section 5 presents a detailed feature table for the investigated human proof, and Sect. 6 shows first experiments where the features and new techniques are used to supplement the inputs of ATP systems with lemmas. Section 7 concludes the paper. Supplementary technical material including proofs is provided in the report [37]. Data and tools to reproduce the experiments are available at http://cs.christophwernhard.com/cd.

#### 2 Relating Formal Human Proofs with ATP Proofs

In 1948 Jan Łukasiewicz published a formal proof of the completeness of his shortest single axiom for the implicational fragment (IF), that is, classical propositional logic with implication as the only logic operator [19]. In his notation the implication <sup>p</sup> <sup>→</sup> <sup>q</sup> is written as Cpq. Following Frank Pfenning [27] we formalize IF on the meta-level in the first-order setting of modern ATP with a single unary predicate P to be interpreted as something like "provable" and represent the propositional formulas by terms using the binary function symbol i for implication. We will be concerned with the following formulas.


IF can be axiomatized by the set of the three axioms Simp, Peirce and Syll, known as Tarski-Bernays Axioms. Alfred Tarski in 1925 raised the problem to

$$\overbrace{\operatorname{\mathbf{P}i(i(ipq,r),i(ipp,isp))}}^{\mathbf{4}} \land \overbrace{\operatorname{\mathbf{P}x \land \operatorname{\mathbf{P}ixy \to \operatorname{\mathbf{P}y}}}^{\mathbf{1}}}^{\mathbf{1}} \land \overbrace{\operatorname{\mathbf{P}i(ipq,i(iqr,ipr))}}^{\mathbf{1}}$$

Fig. 1. *ŁDS* along with its five unifiable connections.

characterize IF by a single axiom and solved it with very long axioms, which led to a search for the shortest single axiom, which was found with the axiom nicknamed after him in 1936 by Łukasiewicz [19]. In 1948 he published his derivation that Łukasiewicz entails the three Tarski-Bernays Axioms, expressed formally by the method of substitution and detachment. Detachment is also familiar as modus ponens. Łukasiewicz's proof involves 34 applications of detachment. Among the Tarski-Bernays axioms Syll is by far the most challenging to prove, hence his proof centers around the proof of Syll, with Peirce and Simp spinning off as side results. Carew A. Meredith presented in [24] a "very slight abridgement" of Łukasiewicz's proof, expressed in his framework of condensed detachment [28], where the performed substitutions are no longer explicitly presented but implicitly assumed through unification. Meredith's proof involves only 33 applications of detachment. In our first-order setting, detachment can be modeled with the following meta-level axiom.

$$\text{Det } \stackrel{\text{def}}{=} \forall xy \left( \mathbb{P}x \land \mathbb{P}ixy \to \mathbb{P}y \right).$$

In Det the atom Px is called the minor premise, Pixy the major premise, and Py the conclusion. Let us now focus on the following particular formula.

$$LDS \stackrel{\text{def}}{=} Luka\\sie wicz \wedge Det \to Syml.$$

"Problem ŁDS" is then the problem of determining the validity of the first order formula ŁDS. In view of the CM [1,2,3], a formula is valid if there is a spanning and complementary set of connections in it. In Fig. 1 ŁDS is presented again, nicknames dereferenced and quantifiers omitted as usual in ATP, with the five unifiable connections in it. Observe that p, q, r, s on the left side of the main implication are variables, while p, q,r on the right side are Skolem constants. Any CM proof of ŁDS consists of a number of instances of the five shown connections. Meredith's proof, for example, corresponds to 491 instances of Det, each linked with three instances of its five incident connections.

Figure 2 compares different representations of a short formal proof with the Det meta axiom. There is a single axiom, Syll Simp, and the theorem is ∀pqrstu Pi(p, i(q, i(r, i(s, i(t, ius))))). Figure 2a shows the structure of a CM proof. It involves seven instances of Det, shown in columns D1,...,D7. The major premise Pixiy<sup>i</sup> is displayed there on top of the minor premise Pxi, and the (negated) conclusion ¬Pyi, where xi, y<sup>i</sup> are variables. Instances of the axiom appear as literals ¬Pai, with a<sup>i</sup> a shorthand for the term i(i(ipiqi, ri), iqiri). The rightmost literal Pg is a shorthand for the Skolemized theorem. The clause instances are linked through edges representing connection instances. The edge

Fig. 2. A proof in different representations.

labels identify the respective connections as in Fig. 1. An actual connection proof is obtained by supplementing this structure with a substitution under which all pairs of literals related through a connection instance become complementary.

Figure 2b represents the tree implicit in the CM proof. Its inner nodes correspond to the instances of Det, and its leaf nodes to the instances of the axiom. Edges appear ordered to the effect that those originating in a major premise of Det are directed to the left and those from a minor premise to the right. The goal clause Pg is dropped. The resulting tree is a full binary tree, i.e., a binary tree where each node has 0 or 2 children. We observe that the ordering of the children makes the connection labeling redundant as it directly corresponds to the tree structure.

Figure 2c presents the proof in Meredith's notation. Each line shows a formula, line 1 the axiom and lines 2–4 derived formulas, with proofs annotated in the last column. Proofs are written as terms in Polish notation with the binary function symbol D for detachment where the subproofs of the major and minor premise are supplied as first and second, resp., argument. Formula 4, for example, is obtained as conclusion of Det applied to formula 2 as major premise and as minor premise another formula that is not made explicit in the presentation, namely the conclusion of Det applied to formula 3 as both, major and minor, premises. An asterisk marks the goal theorem.

Figure 2d is like Fig. 2b, but with a different labeling: Node labels now refer to the line in Fig. 2c that corresponds to the subproof rooted at the node. The blank node represents the mentioned subproof of the formula that is not made explicit in Fig. 2b. An inner node represents a condensed detachment step applied to the subproof of the major premise (left child) and minor premise (right child).

Figure 2e shows a DAG (directed acyclic graph) representation of Figure 2d. It is the unique maximally factored DAG representation of the tree, i.e., it has no multiple occurrences of the same subtree. Each of the four


Fig. 3. Proof MER, Meredith's refinement [24] of Łukasiewicz's proof [19].

proof line labels of Fig. 2c appears exactly once in the DAG.

We conclude this introductory section with reproducing Meredith's refinement of Łukasiewicz's completeness proof in Fig. 3, taken from [24]. Since we will often refer to this proof, we call it MER. There is a single axiom (1), which is Łukasiewicz . The proven theorems are Syll (17), Peirce (18) and Simp (19). In addition to line numbers also the symbol n appears in some of the proof terms. Its meaning will be explained later on in the context of Def. 19. For now, we can read n just as "1". Dots are used in the Polish notation to disambiguate numeric identifiers with more than a single digit.

#### 3 Condensed Detachment and a Formal Basis

Following [4], the idea of condensed detachment can be described as follows: Given premises F → G and H, we can conclude G , where G is the most general result that can be obtained by using a substitution instance H as minor premise with the substitution instance F → G as major premise in modus ponens. Condensed detachment was introduced by Meredith in the mid-1950s as an evolution of the earlier method of substitution and detachment, where the involved substitutions were explicitly given. The original presentations of condensed detachment are informal by means of examples [28,17,29,25], formal specifications have been given later [16,13,4]. In ATP, the rendering of condensed detachment by hyperresolution with the clausal form of axiom Det is so far the prevalent view. As overviewed in [23,31], many of the early successes of ATP were based on condensed detachment. Starting from the hyperresolution view, structural aspects of condensed detachment have been considered by Robert Veroff [34] with the use of term representations of proofs and linked resolution. Results of ATP systems on deriving the Tarski-Bernays axioms from Łukasiewicz are reported in [27,39,22,23,11]. Our goal in this section is to provide a formal framework that makes the achievements of condensed detachment accessible from a modern ATP view. In particular, the incorporation of unification, the interplay of nested structures with explicitly and implicitly associated formulas, sharing of structures through lemmas, and the availability of proof structures as terms.

Our notation follows common practice [6] (e.g., s ≥· t expresses that t subsumes s, and s ☎ t that t is a subterm of s) with some additions [37]. For formulas F we write the universal closure as ∀F, and for terms s, t, u we use s[t → u] to denote s after simultaneously replacing all occurrences of t with u.

#### 3.1 Proof Structures: D-Terms, Tree Size and Compacted Size

In this section we consider only the purely structural aspects of condensed detachment proofs. Emphasis is on a twofold view on the proof structure, as a tree and as a DAG (directed acyclic graph), which factorizes multiple occurrences of the same subtree. Both representation forms are useful: the compacted DAG form captures that lemmas can be repeatedly used in a proof, whereas the tree form facilitates to specify properties in an inductive manner. We call the tree representation of proofs by terms with the binary function symbol D D-terms.

Definition 1. (i) We assume a distinguished set of symbols called primitive D-terms. (ii) A D-term is inductively specified as follows: (1.) Any primitive D-term is a D-term. (2.) If d<sup>1</sup> and d<sup>2</sup> are D-terms, then D(d1, d2) is a D-term. (iii) The set of primitive D-terms occurring in a D-term <sup>d</sup> is denoted by <sup>P</sup>rim(d). (iv) The set of all D-terms that are not primitive is denoted by D.

A D-term is a full binary tree (i.e, a binary tree in which every node has either 0 or 2 children), where the leaves are labeled with symbols, i.e., primitive D-terms. An example D-term is

$$d \stackrel{\text{def}}{=} \mathsf{D}(\mathsf{D}(1,1), \mathsf{D}(\mathsf{D}(1,\mathsf{D}(1,1)), \mathsf{D}(1,\mathsf{D}(1,1)))),\tag{i}$$

which represents the structure of the proof shown in Fig. 2 and can be visualized by the full binary tree of Fig. 2d after removing all labels with exception of the leaf labels. The proof annotations in Fig. 2c and Fig. 3 are D-terms written in Polish notation. The expression D2D33 in line 4 of Fig. 2, for example, stands for the D-term <sup>D</sup>(2, <sup>D</sup>(3, 3)). <sup>P</sup>rim(D(2, <sup>D</sup>(3, 3))) = {2, <sup>3</sup>}.

A finite tree and, more generally, a finite set of finite trees can be represented as DAG, where each node in the DAG corresponds to a subtree of a tree in the given set. It is well known that there is a unique minimal such DAG, which is maximally factored (it has no multiple occurrences of the same subtree) or, equivalently, is minimal with respect to the number of nodes, and, moreover, can be computed in linear time [7]. The number of nodes of the minimal DAG is the number of distinct subtrees of the members of the set of trees. There are two useful notions of measuring the size of a D-term, based directly on its tree representation and based on its minimal DAG, respectively.

Definition 2. (i) The tree size of a D-term d, in symbols t-size(d), is the number of occurrences of the function symbol D in d. (ii) The compacted size of a D-term d is defined as c-size(d) def = |{e ∈D| d ☎ e}|. (iii) The compacted size of a finite set D of D-terms is defined as c-size(D) def = |{e ∈D| d ∈ D and d ☎ e}|.

The tree size of a D-term can equivalently be characterized as the number of its inner nodes. The compacted size of a D-term is the number of its distinct compound subterms. It can equivalently be characterized as the number of the inner nodes of its minimal DAG. As an example consider the D-term d defined in formula (i), whose minimal DAG is shown in Fig. 2e. The tree size of d is t-size(d)=7 and the compacted size of d is c-size(d)=4, corresponding to the cardinality of the set {e ∈D| d ☎ e} of compound subterms of d, i.e., {D(1, 1), D(1, D(1, 1)), D(D(1, D(1, 1)), D(1, D(1, 1))), d}.

As will be explicated in more detail below, each occurrence of the function symbol D in a D-term corresponds to an instance of the meta-level axiom Det in the represented proof. Hence the tree size measures the number of instances of Det in the proof. Another view is that each occurrence of D in a D-term corresponds to a condensed detachment step, without re-using already proven lemmas. The compacted size of a D-term is the number of its distinct compound subterms, corresponding to the view that the size of the proof of a lemma is only counted once, even if it is used multiply. Tree size and compacted size of D-terms appear in [34] as CDcount and length, respectively.

#### 3.2 Proof Structures, Formula Substitutions and Semantics

We use a notion of unifier that applies to a set of pairs of terms, as convenient in discussions based on the CM [1,9,8].

Definition 3. Let M be a set of pairs of terms and let σ be a substitution. (i) σ is said to be a unifier of M if for all {s, t} ∈ M it holds that sσ = tσ. (ii) σ is called a most general unifier of M if σ is a unifier of M and for all unifiers σ of M it holds that σ ≥· σ. (iii) σ is called a clean most general unifier of M if it is a most general unifier of M and, in addition, is idempotent and satisfies <sup>D</sup>om(σ) ∪ VRng(σ) ⊆ Var (M).

The additional properties required for clean most general unifiers do not hold for all most general unifiers.<sup>3</sup> However, the unification algorithms known from the literature produce clean most general unifiers [9, Remark 4.2]. If a set of pairs of terms has a unifier, then it has a most general unifier and, moreover, also a clean most general unifier.

Definition 4. (i) If M is a set of pairs of terms that has a unifier, then mgu(M) denotes some clean most general unifier of M. M is called unifiable and mgu(M) is called defined in this case, otherwise it is called undefined. (ii) We make the convention that proposition, lemma and theorem statements implicitly assert their claims only for the case where occurrences of mgu in them are defined.

<sup>3</sup> The inaccuracy observed by [13] in early formalizations of condensed detachment can be attributed to disregarding the requirement <sup>D</sup>om(σ) ∪ VRng(σ) ⊆ Var (M).

Since we define mgu(M) as a clean most general unifier, we are permitted to make use of the assumption that it is idempotent and that all variables occurring in its domain and range occur in M. Convention 4.ii has the purpose to reduce clutter in proposition, lemma and theorem statements.

The structural aspects of condensed detachment proofs represented by D-terms, i.e., full binary trees, will now be supplemented with associated formulas. Condensed detachment proofs, similar to CM proofs, involve different instances of the input formulas (viewed as quantifier-free, e.g., clauses), which may be considered as obtained in two steps: first, "copies", that is, variants with fresh variables, of the input formulas are created; second a substitution is applied to these copies. Let us consider now the first step. The framework of D-terms permits to give the variables in the copies canonical designators with an index subscript that identifies the position in the structure, i.e., in the D-term, or tree.

Definition 5. For all positions p and positive integers i let x<sup>i</sup> <sup>p</sup> and y<sup>p</sup> denote pairwise different variables.

Recall that positions are path specifiers. For a given D-term d and leaf position p of d the variables x<sup>i</sup> <sup>p</sup> are for use in a formula associated with p which is the copy of an axiom. Different variables in the copy are distinguished by the upper index i. If p is a non-leaf position of d, then y<sup>p</sup> denotes the variable in the conclusion of the copy of Det that is represented by p. In addition, y<sup>p</sup> for leaf positions p may occur in the antecedents of the copies of Det. The following substitution shift<sup>p</sup> is a tool to systematically rename position-associated variables while preserving the internal relationships between the index-referenced positions.

Definition 6. For all positions p define the substitution shift<sup>p</sup> as follows: shift<sup>p</sup> def <sup>=</sup> {y<sup>q</sup> → <sup>y</sup>p.q <sup>|</sup> <sup>q</sup> is a position}∪{x<sup>i</sup> <sup>q</sup> → <sup>x</sup><sup>i</sup> p.q | i ≥ 1 and q is a position}.

The application of shift<sup>p</sup> to a term s effects that p is prepended to the position indexes of all the position-associated variables occurring in s. The association of axioms with primitive D-terms is represented by mappings which we call axiom assignments, defined as follows.

Definition 7. An axiom assignment α is a mapping whose domain is a set of primitive D-terms and whose range is a set of terms whose variables are in {xi -<sup>|</sup> <sup>i</sup> <sup>≥</sup> <sup>1</sup>}. We say that <sup>α</sup> is for a D-term <sup>d</sup> if <sup>D</sup>om(α) ⊇ Prim(d).

We define a shorthand for a form of Łukasiewicz that is suitable for use as a range element of axiom assignments. It is parameterized with a position p.

$$Luka sie wicz\_p \stackrel{\text{def}}{=} \mathfrak{i}(\mathfrak{i}(\mathfrak{i}(x\_p^1, x\_p^2), x\_p^3), \mathfrak{i}(\mathfrak{i}(x\_p^3, x\_p^1), \mathfrak{i}(x\_p^4, x\_p^1))).\tag{\text{ii}}$$

The mapping {1 → Łukasiewicz -} is an axiom assignment for all D-terms d with <sup>P</sup>rim(d) = {1}. The second step of obtaining the instances involved in a proof can be performed by applying the most general unifier of a pair of terms that constrain it. The tree structure of D-terms permits to associate exactly one such pair with each term position. Inner positions represent detachment steps and leaf positions instances of an axiom according to a given axiom assignment. The following definition specifies these constraining pairs.

Definition 8. Let d be a D-term and let α be an axiom assignment for d. For all positions <sup>p</sup> ∈ Pos(d) define the pair of terms pairingα(d, p) def = {yp, α(d|p)shiftp} if <sup>p</sup> ∈ Leaf <sup>P</sup>os(d) and {yp.1, <sup>i</sup>(yp.2, yp)} if <sup>p</sup> ∈ InnerPos(d).

A unifier of the set of pairings of all positions of a D-term d equates for a leaf position p the variable y<sup>p</sup> with the value of the axiom assignment α for the primitive D-term at p, after "shifting" variables by p. This "shifting" means that the position subscript of the variables in the axiom argument term α(d|p) is replaced by p, yielding a dedicated copy of the axiom argument term for the leaf position p. For inner positions p the unifier equates yp.<sup>1</sup> and i(yp.2, yp), reflecting that the major premise of Det is proven by the left child of p.

The substitution induced by the pairings associated with the positions of a D-term allow to associate a specific formula with each position of the D-term, called the in-place theorem (IPT). The case where the position is the top position is distinguished as most general theorem (MGT).

Definition 9. For D-terms <sup>d</sup>, positions <sup>p</sup> ∈ Pos(d) and axiom assignments <sup>α</sup> for <sup>d</sup> define the in-place theorem (IPT) of <sup>d</sup> at <sup>p</sup> for <sup>α</sup>, Ipt<sup>α</sup>(d, p), and the most general theorem (MGT) of <sup>d</sup> for <sup>α</sup>, Mgt<sup>α</sup>(d), as (i) Ipt<sup>α</sup>(d, p) def = <sup>P</sup>(ypmgu({pairingα(d, q) <sup>|</sup> <sup>q</sup> ∈ Pos(d)})). (ii) Mgt<sup>α</sup>(d) def <sup>=</sup> Ipt<sup>α</sup>(d, ).

Since Ipt and Mgt are defined on the basis of mgu, they are undefined if the set of pairs of terms underlying the respective application of mgu is not unifiable. Hence, we apply the convention of Def. 4.ii for mgu also to occurrences of Ipt and Mgt. If Ipt and Mgt are defined, they both denote an atom whose variables are constrained by the clean property of the underlying application of mgu. The following proposition relates IPT and MGT with respect to subsumption.

Proposition 10. For all D-terms <sup>d</sup>, positions <sup>p</sup> ∈ Pos(d) and axiom assignments <sup>α</sup> for <sup>d</sup> it holds that Ipt<sup>α</sup>(d, p) ≥· Mgt<sup>α</sup>(d|<sup>p</sup>).

By Prop. 10, the IPT at some position p of a D-term d is subsumed by the MGT of the subterm d|<sup>p</sup> of d rooted at position p. An intuitive argument is that the only constraints that determine the most general unifier underlying the MGT are induced by positions of d|<sup>p</sup>, that is, below p (including p itself). In contrast, the most general unifier underlying the IPT is determined by all positions of d.

The following lemma expresses the core relationships between a proof structure (a D-term), a proof substitution (accessed via the IPT) and semantic entailment of associated formulas.

Lemma 11. Let d be a D-term and let α be an axiom assignment for d. Then for all <sup>p</sup> ∈ Pos(d) it holds that: (i) If <sup>p</sup> ∈ Leaf <sup>P</sup>os(d), then <sup>∀</sup>P(α(d|<sup>p</sup>)) <sup>|</sup><sup>=</sup> Ipt<sup>α</sup>(d, p). (ii) If <sup>p</sup> ∈ InnerPos(d), then Det <sup>∧</sup> Ipt<sup>α</sup>(d, p.1) <sup>∧</sup> Ipt<sup>α</sup>(d, p.2) <sup>|</sup><sup>=</sup> Ipt<sup>α</sup>(d, p).

Based on this lemma, the following theorem shows how Detachment together with the axioms in an axiom assignment entail the MGT of a given D-term.

Theorem 12. Let d be a D-term and let α be an axiom assignment for d. Then Det ∧ - <sup>p</sup>∈Leaf <sup>P</sup>os(d) <sup>∀</sup>P(α(d|<sup>p</sup>)) <sup>|</sup><sup>=</sup> <sup>∀</sup>Mgt<sup>α</sup>(d).

Theorem 12 states that Det together with the axioms referenced in the proof, that is, the values of α for the leaf nodes of d considered as universally closed atoms, entail the universal closure of the MGT of d for α. The universal closure of the MGT is the formula exhibited in Meredith's proof notation in the lines with a trailing D-term, such as lines 2–19 in Fig. 3.

#### 4 Reducing the Proof Size by Replacing Subproofs

The term view on proof trees suggests to shorten proofs by rewriting subterms, that is, replacing occurrences of subproofs by other ones, with three main aims: (1) To shorten given proofs, with respect to the tree size or the compacted size. (2) To investigate given proofs whether they can be shortened by certain rewritings or are closed under these. (3) To develop notions of redundancy for use in proof search. A proof fragment constructed during search may be rejected if it can be rewritten to a shorter one.

It is obvious that if a D-term d is obtained from a D-term d by replacing an occurrence of a subterm e with a D-term e such that t-size(e) ≥ t-size(e ), then also t-size(d) ≥ t-size(d ). Based on the following ordering relations on D-terms, which we call compaction orderings, an analogy for reducing the compacted size instead of the tree size can be stated.

Definition 13. For D-terms d, e define (i) d ≥<sup>c</sup> e def = {f ∈D| d ✄ f}⊇{f ∈ D | e ✄ f}. (ii) d ><sup>c</sup> e def = d ≥<sup>c</sup> e and e ≥<sup>c</sup> d.

The relations d ≥<sup>c</sup> e and d ><sup>c</sup> e compare D-terms d and e with respect to the superset relationship of their sets of those strict subterms that are compound terms. For example, D(D(D(1, 1), 1), 1) ><sup>c</sup> D(1, D(1, 1)) because {D(1, 1), D(D(1, 1), 1)} ⊇ {D(1, 1)}.

Theorem 14. Let d, d , e, e be D-terms such that e occurs in d, and d = d[e → e ]. It holds that (i) If e ∈ D and e ≥<sup>c</sup> e , then c-size(d) ≥ c-size(d ). (ii) If e ><sup>c</sup> e , then sc-size(d) > sc-size(d ), where, for all D-terms d sc-size(d) def = <sup>d</sup>☎<sup>e</sup> c-size(e).

Theorem 14.i states that if d is the D-term obtained from d by simultaneously replacing all occurrences of a compound D-term e with a "c-smaller" D-term e , i.e., e ≥<sup>c</sup> e , then the compacted size of d is less or equal to that of d. As stated with the supplementary Theorem 14.ii, the sc-size is a measure that strictly decreases under the strict precondition e ><sup>c</sup> e , which is useful to ensure termination of rewriting. The following proposition characterizes the number of D-terms that are smaller than a given D-term w.r.t the compaction ordering ≥c.

Proposition 15. For all D-terms <sup>d</sup> it holds that |{<sup>e</sup> <sup>|</sup> <sup>d</sup> <sup>≥</sup><sup>c</sup> <sup>e</sup> and <sup>P</sup>rim(e) <sup>⊆</sup> <sup>P</sup>rim(d)}| = (c-size(d) <sup>−</sup> 1 + |Prim(d)|)<sup>2</sup> <sup>+</sup> |Prim(d)|.

By Prop. 15, for a given D-term d, the number of D-terms e that are smaller than d with respect to ≥<sup>c</sup> is only quadratically larger than the compacted size of d and thus also than the tree size of d. Hence techniques that inspect all these smaller D-terms for a given D-term can efficiently be used in practice.

According to Theorem 12, a condensed detachment proof, i.e., a D-term d and an axiom assignment α, proves the MGT of d for α along with instances of the MGT. In general, replacing subterms of d should yield a proof of at least these theorems. That is, a proof whose MGT subsumes the original one. The following theorem expresses conditions which ensure that subterm replacements yield a proof with a MGT that subsumes original one.

Theorem 16. Let d, e be D-terms, let α be an axiom assignment for d and for <sup>e</sup>, and let <sup>p</sup>1,...,pn, where <sup>n</sup> <sup>≥</sup> <sup>0</sup>, be positions in <sup>P</sup>os(d) such that for all i, j ∈ {1,...,n} with i = j it holds that p<sup>i</sup> ≤ p<sup>j</sup> . If for all i ∈ {1,...,n} it holds that Ipt<sup>α</sup>(d, pi) ≥· Mgt<sup>α</sup>(e), then Mgt<sup>α</sup>(d) ≥· Mgt<sup>α</sup>(d[e]<sup>p</sup><sup>1</sup> [e]<sup>p</sup><sup>2</sup> ... [e]<sup>p</sup><sup>n</sup> ).

Theorem 16 states that simultaneously replacing a number of occurrences of possibly different subterms in a D-term by the same subterm with the property that its MGT subsumes each of the IPTs of the original occurrences results in an overall D-term whose MGT subsumes that of the original overall D-term. The following theorem is similar, but restricted to a single replaced occurrence and with a stronger precondition. It follows from Theorem 16 and Prop. 10.

Theorem 17. Let d, e be D-terms and let α be an axiom assignment for d and for <sup>e</sup>. For all positions <sup>p</sup> ∈ Pos(d) it then holds that if Mgt<sup>α</sup>(d|<sup>p</sup>) ≥· Mgt<sup>α</sup>(e), then Mgt<sup>α</sup>(d) ≥· Mgt<sup>α</sup>(d[e]p).

Simultaneous replacements of subterm occurrences are essential for reducing the compacted size of proofs according to Theorem 14. For replacements according to Theorem 17 they can be achieved by successive replacements of individual occurrences. In Theorem 16 simultaneous replacements are explicitly considered because the replacement of one occurrence according to this theorem can invalidate the preconditions for another occurrence. Theorem 17 can be useful in practice because the precondition Mgt<sup>α</sup>(d|<sup>p</sup>) ≥· Mgt<sup>α</sup>(e) can be evaluated on the basis of <sup>α</sup>, <sup>e</sup> and just the subterm <sup>d</sup>|<sup>p</sup> of <sup>d</sup>, whereas determining Ipt<sup>α</sup>(d, p) for Theorem 16 requires also consideration of the context of p in d. Based on Theorems 16 and 14 we define the following notions of reduction and regularity.

Definition 18. Let d be a D-term, let e be a subterm of d and let α be an axiom assignment for d. For D-terms e the D-term d[e → e ] is then obtained by C-reduction from d for α if e ><sup>c</sup> e , Mgt<sup>α</sup>(e ) is defined, and for all positions <sup>p</sup> ∈ Pos(d) such that <sup>d</sup>|<sup>p</sup> <sup>=</sup> <sup>e</sup> it holds that Ipt<sup>α</sup>(d, p) ≥· Mgt<sup>α</sup>(e ). The D-term d is called C-reducible for α if and only if there exists a D-term e such that d[e → e ] is obtained by C-reduction from d for α. Otherwise, d is called C-regular.

If d is obtained from d by C-reduction, then by Theorem 16 and 14 it follows that Mgt<sup>α</sup>(d) ≥· Mgt<sup>α</sup>(d ), c-size(d) ≥ c-size(d ) and sc-size(d) > sc-size(d ). Cregularity differs from well known concepts of regularity in clausal tableaux (see, e.g., [14]) in two respects: (1) In the comparison of two nodes on a branch (which is done by subsumption as in tableaux with universal variables) for the upper node the stronger instantiated IPT is taken and for the lower node the more weakly instantiated MGT. (2) C-regularity is not based on relating two nested subproofs, but on comparison of all occurrences of a subproof with respect to all proofs that are smaller with respect to the compaction ordering.

Proofs may involve applications of Det where the conclusion Py is actually independent from the minor premise Px. Any axiom can then serve as a trivial minor premise. Meredith expresses this with the symbol n as second argument of the respective D-term. Our function simp-n simplifies D-terms by replacing subterms with n accordingly on the basis of the preservation of the MGT.

Definition 19. If d is a D-term and α is an axiom assignment for d, then the n-simplification of d with respect to α is the D-term simp-nα(d), where simp-n is the following function: simp-nα(d) def = d, if d is a primitive D-term; simp-nα(D(d1, d2)) def = D(simp-n<sup>α</sup>- (d1), n) if Mgt<sup>α</sup>-<sup>D</sup>(d1, n) = Mgt<sup>α</sup>D(d1, d2), where α = α ∪ {n → k} for a fresh constant k; simp-nα(D(d1, d2)) def = D(simp-nα(d1),simp-nα(d2)), else.

#### 5 Properties of Meredith's Refined Proof

Our framework renders condensed detachment as a restricted form of the CM. This view permits to consider the expanded proof structures as binary trees or D-terms. On this basis we obtain a natural characterization of proof properties in various categories, which seem to be the key towards reducing the search space in ATP. Table 1 shows such properties for each of the 34 structurally different subproofs of proof MER (Fig. 3). Column M gives the number of the subproof in Fig. 3. We use the following short identifiers for the observed properties:

Structural Properties of the D-Term. These properties refer to the respective subproof as D-term or full binary tree. DT, DC, DH: Tree size, compacted size, height. DKL, DKR: "Successive height", that is, the maximal number of successive edges going to the left (right, resp.) on any path from the root to a leaf. DP: Is "prime", that is, DT and DC are equal. DS: Relationship between the subproofs of major and minor premise. Identity is expressed with =, the subterm and superterm relationships with ✁ and ✄, resp., and the compaction ordering relationship (if none of the other relationships holds) with <<sup>c</sup> and >c. In addition it is indicated if a subproof is an axiom or n. DD: "Direct sharings", that is, the number of incoming edges in the DAG representation of the overall proof of all theorems. DR: "Repeats", that is, the total number of occurrences in the set of expanded trees of all roots of the DAG.

Properties of the MGT. These properties refer to the argument term of the MGT of the respective subproof. TT, TH: Tree size (defined as for D-terms) and height. TV: Number of different variables occurring in the term. TO: Is "organic" [21], that is, the argument term has no strict subterm s such that P(s) itself is a theorem. We call an atom weakly organic (indicated by a gray bullet) if it is not organic and the argument term is of the form i(p, t) where p is a variable that does not occur in the term t and P(t) is organic. For axiomatizations of fragments of propositional logic, organic can be checked by a SAT solver.

Regularity. RC: The respective subproof as D-term is C-regular (see Def. 18).


Table 1. Properties of all subproofs of the proof MER [24] shown in Fig. 3.

Comparisons with all Proofs of the MGT. These properties relate to the set of all proofs (as D-terms) of the MGT of the respective subproof. MT, MC: Minimal tree size and minimal compacted size of a proof. These values can be hard to determine such that in Table 1 they are often only narrowed down by an integer interval. To determine them, we used the proof MER, proofs obtained with techniques described in Sect. 6, and enumerations of all D-terms with defined MGT up to a given tree size or compacted size.

Properties of Occurrences of the IPTs. The respective subproof has DR occurrences in the set of expanded trees of the roots of the DAG, where each occurrence has an IPT. The following properties refer to the multiset of argument terms of the IPTs of these occurrences. IT<sup>U</sup> , ITM: Maximal tree size and rounded median of the tree size. IH<sup>U</sup> , IHM: Maximal height and rounded median of the height. In Table 1 these values are much larger than those of the corresponding columns for the MGT, i.e, TT and TH, illustrating Prop. 10.

## 6 First Experiments

First experiments based on the framework developed in the previous sections are centered around the generation of lemmas where not just formulas but, in the form of D-terms, also proofs are taken into account. This leads in general to preference of small proofs and to narrowing down the search space by re-


Table 2. Proof dimensions of various proofs of problem *ŁDS*.

stricted structuring principles to build proofs. The experiments indicate novel potential calculi which combine aspects from lemma-based generative, bottomup, methods such as hyperresolution and hypertableaux with structure-based approaches that are typically used in an analytic, goal-directed, way such as the CM. In addition, ways to generate lemmas as preprocessing for theorem proving are suggested, in particular to obtain short proofs. This resulted in a refinement of Łukasiewicz's proof [19], whose compacted size is by one smaller than that of Meredith's refinement [24] and by two than Łukasiewicz's original proof.

Table 2 shows compacted size DC, tree size DT and height DH of various proofs of ŁDS. Asterisks indicate that n-simplification was applied with reducing effect on the system's proof. Proof (1.) is the one by Łukasiewicz [19], translated into condensed detachment, proof (2.) is proof MER (Fig. 3) [24]. Rows (3.)–(5.) show results from Prover9 , where in (5.) the value of max\_depth was limited to 7, motivated by column TH of Table 1. Proof (4.) illustrates the effect of nsimplification.<sup>4</sup> For proofs (6.)–(9.) additional axioms were supplied to Prover9 and CMProver [5,35,36], a goal-directed system that can be described by the CM. Columns indicate the lemma computation method, the number of lemmas supplied to the prover and the time used for lemma computation. Method PrimeCore adds the MGTs of subproof 18 from Table 1 and all its subproofs as lemmas. Subproof 18 is the largest subproof of proof MER that is prime and can be characterized on the basis of the axiom – almost uniquely – as a proof that is prime, whose MGT has no smaller prime proof and has the same number of different variables as the axiom, i.e., 4, and whose size, given as parameter, is 17. Method ProofSubproof is based on detachment steps with a D-term and a subterm of it as proofs of the premises, which, as column DS of Table 1 shows, suffices to justify all except of two proof steps in MER. It proceeds in some analogy to the given clause algorithm on lists of D-terms: If d is the given D-term, then the inferred D-terms are all D-terms that have a defined MGT and are of the form D(d, e) or D(e, d), where e is a subterm of d. To determine which of the inferred D-terms are kept, values from Table 1 were taken as guide, including RC and TO. The first parameter of ProofSubproof is the number of iterations of the "given D-term loop". Proof (9.) can be combined with Peirce and Syll to the overall proof with compacted size 32, one less than MER. The maximal value of DK<sup>L</sup> is shown as second parameter, because, when limited to 7, proof (9.)

<sup>4</sup> All machine results refer to a system with Intel i7-8550U CPU and 16 GB RAM. Results for further systems: *KRHyper*<sup>∗</sup> [26]: 1.610 s, DC: 73; *E 2.5* [30]: 30 s, proof length 91; *Vampire 5.4.1* [33] –mode casc -t 300: 128 s, proof length 144.

cannot be found. Proof (10.), which has a small tree size, was obtained from (8.) by rewriting subproofs with a variation of C-reduction that rewrites single term occurrences, considering also D-terms from a precomputed table of small proofs.

## 7 Conclusion

Starting out from investigating Łukasiewicz's classic formal proof [19], via its refinement by Meredith [24] we arrived at a formal reconstruction of Meredith's condensed detachment as a special case of the CM. The resulting formalism yields proofs as objects of a very simple and common structure: full binary trees which, in the tradition of term rewriting, appear as terms, D-terms, as we call them. To form a full proof, formulas are associated with the nodes of D-terms: axioms with the leaves and lemmas with the remaining nodes, implicitly determined from the axioms through the node position and unification. The root lemma is the most general proven theorem. Lemmas also relate to compressed representations of the binary trees, for example as DAGs, where the re-use of a lemma directly corresponds to sharing the structure of its subproof. For future work we intend to position our approach also in the context of earlier works on proofs, proof compression and lemma introduction, e.g., [38,12], and think of compressing D-Terms in forms that are stronger than DAGs, e.g., by tree grammars [18].

The combination of formulas and explicitly available proof structures naturally leads to theorem proving methods that take structural aspects into account, in various ways, as demonstrated by our first experiments. This goes beyond the common clausal tableau realizations of the CM, which in essence operate by enumerating uncompressed proof structures. The discussed notions of regularity and lemma generation methods seem immediately suited for further investigations in the context of first-order theorem proving in general. For other aspects of the work we plan a stepwise generalization by considering further single axioms for the implicational fragment IF [21,19,32], single axioms and axiom pairs for further logics [32], the about 200 condensed detachment problems in the LCL domain of the TPTP, problems which involve multiple non-unit clauses, and adapting D-terms to a variation of binary resolution instead of detachment. In the longer run, our approach aims at providing a basis for approaches to theorem proving with machine learning (e.g. [10,15]). With the reification of proof structures more information is available as starting point. As indicated with our exemplary feature table for Meredith's proof, structural properties are considered thereby from a global point of view, as a source for narrowing down the search space in many different ways in contrast to just the common local view "from within a structure", where the narrowing down is achieved for example by focusing on a "current branch" during the construction of a tableau. A general lead question opened up by our setting is that for exploring relationships between properties of proof structures and the associated formulas in proofs of meaningful theorems. One may expect that characterizations of these relationships can substantially restrict the search space for finding proofs.

Acknowledgments. We appreciate the competent comments of all the referees.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Efficient Local Reductions to Basic Modal Logic***-*

> Fabio Papacchini<sup>1</sup> , Cl´audia Nalon<sup>2</sup> , Ullrich Hustadt<sup>1</sup> , and Clare Dixon<sup>4</sup>

<sup>1</sup> Department of Computer Science, University of Liverpool, UK, {Fabio.Papacchini,U.Hustadt}@liverpool.ac.uk <sup>2</sup> Department of Computer Science, University of Bras´ılia, nalon@unb.br <sup>3</sup> Department of Computer Science, University of Manchester, clare.dixon@manchester.ac.uk

**Abstract.** We present novel reductions of the propositional modal logics KB, KD, KT, K4 and K5 to Separated Normal Form with Sets of Modal Levels. The reductions result in smaller formulae than the well-known reductions by Kracht and allow us to use the local reasoning of the prover KSP to determine the satisfiability of modal formulae in these logics. We show experimentally that the combination of our reductions with the prover KSP performs well when compared with a specialised resolution calculus for these logics and with the built-in reductions of the first-order ˘ prover SPASS.

#### **1 Introduction**

The main motivation for reducing problems in one logic (the source logic) to 'equivalent' problems in another logic (the target logic) is to exploit results and tools for the target logic to solve theoretical or practical problems in the source logic. For propositional modal logics this approach has been researched extensively for reductions of the satisfiability problem in these logics to the satisfiability problem in 'stronger' logics such as first-order logic [10,20], the second-order theory of n successors [6], simple type theory [4], and regular grammar logics [19].

An alternative approach is to reduce propositional modal logics to a 'weaker' logic, in particular, the basic modal logic K. For extensions of K with one of the axioms B, D, alt1, T, and 4, Kracht [12] defines reduction functions of their global and local satisfiability problem to the corresponding problem in K and proves their correctness. He also defines a reduction function for K5, the extension of K with 5, to K4, but this reduction is incorrect as not all theorems of K4 are theorems of K5. Several features of Kracht's approach are relevant to our work. First, as is not uncommon in modal logic, he treats the modal operator ✸ as abbreviation for ¬✷¬, that is, ✷ is the only modal operator occurring in modal formulae. Second, the basic idea underlying his reduction functions

<sup>-</sup> C. Dixon was partially supported by the EPSRC funded RAI Hubs FAIR-SPACE (EP/R026092/1) and RAIN (EP/R026084/1), and the EPSRC funded programme Grant S4 (EP/N007565/1).

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 76 92, 2021. https://doi.org/10.1007/978-3-030-79876-5 5 –

is for a given modal formula ϕ to generate sufficiently many instances Δ of a modal axiom Λ so that ϕ is KΛ-satisfiable iff ϕ ∧ Δ is K-satisfiable. Third, Kracht is only concerned with preservation of the computational complexity of the satisfiability problem under consideration, as well as the preservation of other theoretical properties. For instance, the local satisfiability problem in the modal logics covered by Kracht is PSPACE-complete. So, it is sufficient to ensure that Δ is polynomial in size with respect to ϕ. As Kracht himself concludes, his method offers a uniform way of transferring results about one modal logic to another, but may not be as useful for practical applications.

In [16,15] we have introduced a new normal form for basic multi-modal logic, called Separated Normal Form with Modal Levels, SNFml, that uses labelled modal clauses. These labels refer to the level within a tree Kripke structure at which a modal clause holds. This can be seen as a compromise between approaches that label formulae with worlds at unspecified level [1,3] and approaches that label formulae with paths [5,23]. A combination of a normal form transformation for modal formulae and a resolution-based calculus for labelled modal clauses can then be used to decide local and global satisfiability in basic modal logic. In [17,18] we have presented KSP, an implementation of that calculus, together with an experimental evaluation that indicates that KSP performs well if propositional variables are evenly spread across a wide range of modal levels within the formulae one wants to decide.

A feature of SNFml is its use of additional propositional symbols as 'surrogates' for subformulae of a modal formula ϕ. In the following we take advantage of the availability of those surrogates to provide a novel transformation from extensions of K with a single one of the axioms B, D, T, 4 and 5 to SNFml. Another novel aspect is that we modify the normal form so that it uses sets of modal levels as labels instead of a single modal level. In K we only need a definition of a surrogate at the modal level at which the corresponding subformula occurs in ϕ. But in KB, KT, K4 and K5, we need a definition at every reachable modal level, of which there can be many. We call the resulting normal form, *Separated Normal Form with Sets of Modal Levels*, SNFsml.

The structure of the paper is as follows. In Section 2 we recap common concepts of propositional modal logic including its syntax and semantics. Section 3 defines SNFsml and the reductions of K, KB, KD, KT, K4 and K5 to SNFsml. Correctness is proved in Section 4. Related work is discussed in Section 5. In Section 6 we compare the performance of a combination of our reductions and the modal-layered resolution calculus implemented in prover KSP with resolution calculi specifically designed for the logics under consideration and with translation-based approaches built into the first-order theorem prover SPASS.

#### **2 Preliminaries**

The language of modal logic is an extension of the language of propositional logic with a unary modal operator ✷ and its dual ✸. More precisely, given a denumerable set of *propositional symbols*, P = {p, p0, q, q0, t, t0,...} as well as propositional *constants* **true** and **false**, *modal formulae* are inductively defined as follows: Constants and propositional symbols are modal formulae. If ϕ and ψ are modal formulae, then so are ¬ϕ, (ϕ ∧ ψ), (ϕ ∨ ψ), (ϕ → ψ), ✷ϕ, and ✸ϕ. We also assume that ∧ and ∨ are associative and commutative operators and consider, e.g., (p∨(q∨r)) and (r∨(q∨p)) to be identical formulae. We often omit parentheses if this does not cause confusion. By var(ϕ) we denote the set of all propositional symbols occurring in ϕ. This function straightforwardly extends to finite sets of modal formulae. A *modal axiom (schema)* is a modal formula ψ representing the set of all instances of ψ.

A *literal* is either a propositional symbol or its negation; the set of literals is denoted by L. We denote by ¬l the *complement* of the literal l ∈ L, that is, ¬l denotes ¬p if l is the propositional symbol p, and ¬l denotes p if l is the literal ¬p. A *modal literal* is either ✷l or ✸l, where l ∈ L.

A *(normal) modal logic* is a set of modal formulae which includes all propositional tautologies, the axiom schema ✷(ϕ → ψ) → (✷ϕ → ✷ψ), called the *axiom* K, is closed under modus ponens (if ϕ and ϕ → ψ then ψ) and the rule of necessitation (if ϕ then ✷ϕ).

K is the weakest modal logic, that is, the logic given by the smallest set of modal formulae constituting a normal modal logic. By KΣ we denote an *extensions* of K by a set Σ of axioms.

The standard semantics of modal logics is the *Kripke semantics* or *possible world semantics*. A *Kripke frame* F is an ordered pair W, R where W is a non-empty set of *worlds* and R is a binary (accessibility) relation over W. A *Kripke structure* M over P is an ordered pair F, V where F is a Kripke frame and the *valuation* V is a function mapping each propositional symbol in P to a subset V (p) of W. We say M = F, V is *based on the frame* F. A *rooted Kripke structure* is an ordered pair M,w0 with w<sup>0</sup> ∈ W. To simplify notation, in the following we write W, R, V and W, R, V, w0 instead of W, R, V and W, R, V , w0, respectively.

Satisfaction (or truth) of a formula at a world w of a Kripke structure M = W, R, V is inductively defined by:

M,w |= **true**; M,w |= **false**; M,w |= p iff w ∈ V (p), where p ∈ P; M,w |= ¬ϕ iff M,w |= ϕ; M,w |= (ϕ ∧ ψ) iff M,w |= ϕ and M,w |= ψ; M,w |= (ϕ ∨ ψ) iff M,w |= ϕ or M,w |= ψ; M,w |= (ϕ → ψ) iff M, w |= ¬ϕ or M,w |= ψ; M,w |= ✷ϕ iff for every v, wRv implies M,v |= ϕ; M,w |= ✸ϕ iff there is v, wRv and M,v |= ϕ.

If M,w |= ϕ holds then M is a *model* of ϕ, ϕ is *true at* w *in* M and M *satisfies* ϕ. A modal formula ϕ is *satisfiable* iff there exists a Kripke structure M and a world w in M such that M,w |= ϕ. A modal formula ϕ is *globally true* or *valid* in a Kripke structure M if it is true at all worlds of M; it is *valid* if it is valid in all Kripke structures.


**Table 1.** Modal axioms and relational frame properties

In the following we are interested in extensions of K with the axiom schemata shown in Table 1. Each of these axiom schemata defines a class of Kripke frames where the accessibility relation R satisfies the first-order property stated in the table. Given a normal modal logic L with corresponding class of frames F, we say a modal formula ϕ is L*-satisfiable* iff there exists a frame F ∈ F, a valuation V and a world w<sup>0</sup> ∈ F such that F, V, w0 |= ϕ.

A *path rooted at* w *of length* k, k ≥ 0, in a frame F = W, R is a sequence w = (w0, w1,...,wk) where for every i, 1 ≤ i ≤ k, w<sup>i</sup>−<sup>1</sup> R wi. We say that the path (w0, w1,...,wk) *connects* w<sup>0</sup> *and* wk. For a path w = (w0,...,wk) and world wk+1 with w<sup>k</sup> R wk+1, w ◦ wk+1 denotes the path (w0,...,wk, wk+1). A path (w0) of length 0 is identified with its root w0. We denote the set of all paths rooted at a world w<sup>0</sup> in F by F [w0] and the set of all paths by F . The function trm : F <sup>→</sup> <sup>W</sup> maps every path w = (w0,...,wk) to its terminal world <sup>w</sup><sup>k</sup> while the function len : F <sup>→</sup> <sup>N</sup> maps every path w = (w0, w1,...,wk) to its length <sup>k</sup>.

A rooted Kripke structure M = W, R, V, w0 is a *rooted tree Kripke structure* iff R is a tree, that is, a directed acyclic connected graph where each node has at most one predecessor, with *root* w0. It is a *rooted tree Kripke model* of a modal formula ϕ iff W, R, V, w0 |= ϕ. In a rooted tree Kripke structure with root w<sup>0</sup> for every world w<sup>k</sup> ∈ W there is exactly one path w connecting w<sup>0</sup> and wk; the *modal level of* w<sup>k</sup> *(in* M*), denoted by* mlM(wk), is given by len( w).

Let <sup>F</sup> <sup>=</sup> W, R be a Kripke frame with <sup>w</sup> <sup>∈</sup> <sup>W</sup>. The *unravelling* <sup>F</sup> <sup>u</sup>[w] *of* <sup>F</sup> *at* <sup>w</sup> is the frame W , R where:

**–** W = F [w] is the set of all rooted paths at w in F;

**–** for all v, w <sup>∈</sup> W , if w <sup>=</sup> v ◦ <sup>w</sup> for some <sup>w</sup> <sup>∈</sup> <sup>W</sup>, then v R <sup>w</sup>.

Let F = W, R and F = W , R be two Kripke frames. A function f : W → W is a *p-morphism* (or a *bounded morphism*) from F to F if the following holds: **–** if vRw, then f(v) R f(w).

**–** if f(u) R w, then there exists v ∈ W s.t. f(v) = w and uRv.

Analogously for Kripke models. For F = W, R, M = F, V , w0, and M = <sup>F</sup> <sup>u</sup>[w0],V,(w0), the function trm is a p-morphism from <sup>M</sup> to <sup>M</sup> .

When considering local satisfiability, the following holds (see, [8]):

**Theorem 1.** *Let* ϕ *be a modal formula. Then* ϕ *is* K*-satisfiable iff there is a finite rooted tree Kripke structure* M = F, V, w0 *such that* M,w0 |= ϕ*.*

```
ϕ ∧ ϕ ⇒ ϕ
   ϕ ∨ ϕ ⇒ ϕ
ϕ ∧ true ⇒ ϕ
                ϕ ∧ ¬ϕ ⇒ false
                ϕ ∨ ¬ϕ ⇒ true
              ϕ ∧ false ⇒ false
                                  ✷true ⇒ true
                                  ✸false ⇒ false
                                ϕ ∨ false ⇒ ϕ
                                                    ¬true ⇒ false
                                                    ¬false ⇒ true
                                                  ϕ ∨ true ⇒ true
                                                                   ¬¬ϕ ⇒ ϕ
```
**Table 2.** Rewriting Rules for Simplification

For the normal form transformation presented in the next section we assume that any modal formula ϕ has been simplified by exhaustively applying the rewrite rules in Table 2 and is in Negation Normal Form (NNF), that is, a formula where only propositional symbols are allowed in the scope of negations. We say that such a formula is in *simplified NNF*.

## **3 Layered Normal Form with Sets of Levels**

A formula to be tested for satisfiability is first transformed into a normal form called *Separated Normal Form with Sets of Modal Levels*, SNFsml, whose language extends that of modal logic with labels consisting of sets of modal levels. Informally, we write S : ϕ, where S is a set of natural numbers, to denote that a formula <sup>ϕ</sup> is true at modal levels ml <sup>∈</sup> <sup>S</sup>. We write : <sup>ϕ</sup> instead of <sup>N</sup> : <sup>ϕ</sup>.

We introduce some notation that will be used in the following. Let S<sup>+</sup> = {l+ 1 <sup>∈</sup> <sup>N</sup> <sup>|</sup> <sup>l</sup> <sup>∈</sup> <sup>S</sup>}, <sup>S</sup><sup>−</sup> <sup>=</sup> {l−<sup>1</sup> <sup>∈</sup> <sup>N</sup> <sup>|</sup> <sup>l</sup> <sup>∈</sup> <sup>S</sup>}, and <sup>S</sup><sup>≥</sup> <sup>=</sup> {<sup>n</sup> <sup>|</sup> <sup>n</sup> <sup>≥</sup> min(S)}, where min(S) is the least element in S. Note that the restriction of the elements being in N implies that S<sup>−</sup> cannot contain negative numbers.

The labels in SNFsml work as a kind of *weak* universal operator, allowing us to talk about formulae that are satisfied at all worlds in a given set of modal levels. Formally, we restrict ourselves to rooted tree Kripke structures M = W, R, V, w0 and if S is a set of modal levels, then by M[S] we denote the set of worlds that are at a modal level in S, that is, M[S] = {w ∈ W | mlM(w) ∈ S}. The satisfaction of labelled formulae in a rooted tree Kripke structure M is then defined as follows:

M |= S : ϕ iff for every world w ∈ M[S], we have M,w |= ϕ.

If M |= S : ϕ, then we say that S : ϕ holds in M. Note that if S = ∅, then M |= S : ϕ trivially holds. For a set Φ of labelled formulae, M |= Φ iff M |= S : ϕ for every S : ϕ in Φ, and we say Φ is K*-satisfiable*.

A labelled modal formula is then an SNFsml clause iff it is of one of the following forms:


where <sup>S</sup> <sup>⊆</sup> <sup>N</sup> and <sup>l</sup>, <sup>l</sup> , <sup>l</sup><sup>b</sup> are propositional literals with 1 <sup>≤</sup> <sup>b</sup> <sup>≤</sup> <sup>r</sup>, <sup>r</sup> <sup>∈</sup> <sup>N</sup>. Positive and negative modal clauses are together known as *modal clauses*. We regard a literal clause as a set of literals, that is, two clauses are the same if they contain the same set of literals.

We assume that the set P of propositional symbols is partitioned into two infinite sets Q and T such that for every modal formula ψ we have var(ψ) ⊂ Q and there exists a propositional symbol t<sup>ψ</sup> ∈ T uniquely associated with ψ.

Given a modal formula ϕ in simplified NNF and L ∈ {K,KB,KD,KT,K4,K5}, then we can obtain a set Φ<sup>L</sup> of clauses in SNFsml such that ϕ is L-satisfiable iff Φ<sup>L</sup> is K-satisfiable as Φ<sup>L</sup> = {{0} : tϕ} ∪ ρL({0} : t<sup>ϕ</sup> → ϕ), where ρ<sup>L</sup> is defined as follows:

$$\begin{aligned} \rho\_L(S:t \to \mathtt{true}) &= \emptyset \\ \rho\_L(S:t \to \mathtt{false}) &= \{S: \neg t\} \\ \rho\_L(S:t \to (\psi\_1 \land \psi\_2)) &= \{S: \neg t \lor \eta(\psi\_1), S: \neg t \lor \eta(\psi\_2)\} \cup \delta\_L(S, \psi\_1) \cup \delta\_L(S, \psi\_2) \\ \rho\_L(S:t \to \psi) &= \{S: \neg t \lor \psi\} \end{aligned}$$

if ψ is a disjunction of literals

$$\rho\_L(S:t \to (\psi\_1 \lor \psi\_2)) = \{ S: \neg t \lor \eta(\psi\_1) \lor \eta(\psi\_2) \} \cup \delta\_L(S, \psi\_1) \cup \delta\_L(S, \psi\_2)$$

$$\text{if } \psi\_1 \lor \psi\_2 \text{ is not a disjoint of literals}$$

$$\rho\_L(\mathcal{S}:t \to \neg \diamond \psi) = \{ \mathcal{S}:t \to \neg \diamond \eta(\psi) \} \cup \delta\_L(\mathcal{S}^+, \psi)$$

$$\begin{aligned} \rho\_L(S:t \to \diamondsuit \psi) &= \{ S:t \to \diamondsuit \eta(\psi) \} \cup \delta\_L(S^+, \psi) \\ \rho\_L(S:t \to \square \psi) &= P\_L(S:t \to \square \psi) \cup \Delta\_L(S:t \to \square \psi) \end{aligned}$$

where η and δ<sup>L</sup> are defined as follows:

$$\eta(\psi) = \begin{cases} \psi, & \text{if } \psi \text{ is a} \\ & \text{liter} \\ t\_{\psi}, & \text{otherwise} \end{cases} \qquad \delta\_L(S, \psi) = \begin{cases} \emptyset, & \text{if } \psi \text{ is a} \\ & \text{liter} \\ \rho\_L(S: t\_{\psi} \to \psi), & \text{otherwise} \end{cases}$$

and functions PL, Δ<sup>L</sup> are defined as shown in Table 3. The function η maps a propositional literal ψ to itself while it maps every other modal formula ψ to a new propositional symbol t<sup>ψ</sup> ∈ T uniquely associated with ψ. We call t<sup>ψ</sup> the *surrogate* of ψ or simply a surrogate. The functions PKB and PK5 introduce additional propositional symbols, called *supplementary propositional symbols*, t✷¬t✷<sup>ψ</sup> ∈ T and t✸t✷<sup>ψ</sup> ∈ T, respectively, that do not correspond to subformulae of the formula we are transforming.

Intuitively, PKB is based on the following consideration: Take a world w in a Kripke structure M with a symmetric accessibility relation R. If there exists a world v with wRv such that M,v |= ✷ψ, then M,w |= ψ. Now, take the contrapositive of that statement: If M,w |= ψ, then for every world v with wRv, M,v |= ✷ψ. Equivalently, M,w |= ψ or M,w |= ✷¬✷ψ. This is expressed by the formula η(ψ) ∨ t✷¬t✷<sup>ψ</sup> . For PK5 , the formula t✸t✷<sup>ψ</sup> → ✷t✸t✷<sup>ψ</sup> expresses an instance of axiom schema 5, ✸ϕ → ✷✸ϕ, with ϕ = ✷ψ, i.e., ✸✷ψ → ✷✸✷ψ. The contrapositive of axiom schema 5 is ✸✷ϕ → ✷ϕ, equivalent to ¬✸✷ϕ ∨ ✷ϕ. For ϕ = ψ this is expressed by the formula ¬t✸t✷<sup>ψ</sup> ∨ t✷ψ. For the formula ¬t✸t✷<sup>ψ</sup> → ✷¬t✷ψ, consider ¬✸✷ψ. By duality of ✷ and ✸, this is


**Table 3.** Transformation of ✷-formulae in modal logic L

equivalent to ¬¬✷¬✷ψ and ✷¬✷ψ. So, ¬✸✷ψ → ✷¬✷ψ in every normal modal logic, not only K5. The remaining labelled formulae introduced by PKB and PK5 ensure that supplementary propositional symbols are defined. For the remaining logics the additional clauses are also based directly on the axiom schemata.

To simplify presentation in the following, we define a function η<sup>f</sup> as follows:

$$\begin{aligned} \eta\_f(\varphi\_1 \wedge \varphi\_2) &= \eta(\varphi\_1) \wedge \eta(\varphi\_2) & \eta\_f(\varphi\_1 \vee \varphi\_2) &= \eta(\varphi\_1) \vee \eta(\varphi\_2) \\ \eta\_f(\Box \varphi) &= \Box \eta(\varphi) & \eta\_f(\Diamond \varphi) &= \Diamond \eta(\varphi) \end{aligned}$$

and we treat the two clauses S : ¬t<sup>ψ</sup>1∧ψ<sup>2</sup> ∨ η(ψ1) and S : ¬t<sup>ψ</sup>1∧ψ<sup>2</sup> ∨ η(ψ2) resulting from the normal form transformation of ψ<sup>1</sup> ∧ ψ<sup>2</sup> as a single 'clause' S : ¬t<sup>ψ</sup>1∧ψ<sup>2</sup> ∨ η<sup>f</sup> (ψ<sup>1</sup> ∧ ψ2). We also interchangeably write S : ¬t✷<sup>ψ</sup> ∨ η<sup>f</sup> (✷ψ) for S : t✷<sup>ψ</sup> → η<sup>f</sup> (✷ψ) and, analogously, S : ¬t✸<sup>ψ</sup> ∨ η<sup>f</sup> (✸ψ) for S : t✸<sup>ψ</sup> → η<sup>f</sup> (✸ψ). We then call any clause of the form S : ¬t<sup>ψ</sup> ∨ η<sup>f</sup> (ψ) a *definitional clause*.

**Definition 1.** *Let* Φ *be a set of* SNFsml *clauses. We say* t<sup>ψ</sup> ∈ T occurs at level ml in Φ *iff either*


**Definition 2.** *Let* Φ *be a set of* SNFsml *clauses. Then* Φ *is* definition-complete *iff for every* t<sup>ψ</sup> ∈ T *and every level* ml*, if* t<sup>ψ</sup> *occurs at level* ml *in* Φ *then there exists a clause* S : ¬t<sup>ψ</sup> ∨ η<sup>f</sup> (ψ) *in* Φ *with* ml ∈ S*.*

**Theorem 2.** *Let* L ∈ {K,KB,KD,KT,K4,K5}*. Then* Φ<sup>L</sup> = {{0} : tϕ} ∪ ρL({0} : t<sup>ϕ</sup> → ϕ) *is definition-complete.*

*Proof.* By induction over the computation of ΦL. It is straightforward to see that the transformation of labelled formulae S : t → (ψ<sup>1</sup> ∧ ψ2) and S : t → (ψ<sup>1</sup> ∨ ψ2) only introduces surrogates at levels in S and Δ<sup>L</sup> then adds definitional clauses for those surrogates. The transformation of a labelled formula S : t✸<sup>ψ</sup> → ✸ψ may introduce a surrogate at levels in S<sup>+</sup> and δL(S+, ψ) then adds definitional clauses for those surrogates. The transformation of a labelled formula S : t✷<sup>ψ</sup> → ✷ψ depends on the logic L. We can see that for every level at which a new surrogate occurs in PL(S : t✷<sup>ψ</sup> → ✷ψ), then ΔL(S : t✷<sup>ψ</sup> → ✷ψ) contains a definitional clause for it at that level.

#### **4 Correctness**

Due to space constraints we only prove the correctness of the transformation for KB. We first state several lemmata that are used in the correctness proofs for all logics.

**Lemma 1.** *Let* Φ *be a set of definitional clauses such that every* t<sup>ψ</sup> *occurring in* Φ *is an element of* T *and all other propositional symbols occurring in* Φ *are in* <sup>Q</sup>*. Let* <sup>M</sup> <sup>=</sup> W, R, V, w0 *be a rooted Kripke structure. Let* W , R *be the unravelling of* W, R *at* <sup>w</sup>0*. Let* M <sup>=</sup> W , R, V <sup>Σ</sup>,(w0) *be a Kripke structure such that*

**–** V <sup>Σ</sup>(p) = { w <sup>∈</sup> W <sup>|</sup> trm( w) <sup>∈</sup> <sup>V</sup> (p)} *for every propositional symbol* <sup>p</sup> <sup>∈</sup> <sup>Q</sup>*, and* **–** V <sup>Σ</sup>(tψ) = { w <sup>∈</sup> W | M, <sup>w</sup> |<sup>=</sup> <sup>ψ</sup>} *for every surrogate* <sup>t</sup><sup>ψ</sup> <sup>∈</sup> <sup>T</sup> <sup>∩</sup> var(Φ)*. Then* M <sup>|</sup><sup>=</sup> <sup>Φ</sup>*.*

**Lemma 2.** *Let* ϕ *be a* L*-satisfiable modal formula in simplified NNF where* L *is a normal modal logic and let* Φ = {{0} : tϕ} ∪ ρ<sup>K</sup> ({0} : t<sup>ϕ</sup> → ϕ)*. Let* <sup>M</sup> <sup>=</sup> W, R, V, w0 *be a rooted* <sup>K</sup> *model of* <sup>ϕ</sup>*. Let* W , R *be the unravelling of* W, R *at* <sup>w</sup>0*. Let* M <sup>=</sup> W , R, V , (w0) *be a Kripke structure such that*


**Lemma 3.** *Let* <sup>M</sup> <sup>=</sup> W, R, V, w0 *be a rooted Kripke structure. Let* W , R *be the unravelling of* W, R *at* <sup>w</sup>0*. Let* M <sup>=</sup> W , R, V <sup>Σ</sup>,(w0) *where* V <sup>Σ</sup>(p) = { w ∈ W <sup>|</sup> trm( w) <sup>∈</sup> <sup>V</sup> (p)} *for every propositional symbol* <sup>p</sup> <sup>∈</sup> <sup>Q</sup>*.*

*Then for every modal formula* <sup>ψ</sup> *over* <sup>Q</sup> *and for every world* w <sup>∈</sup> W *,* M, <sup>w</sup> |<sup>=</sup> ψ *iff* M,trm( w) |= ψ*.*

**Lemma 4.** *Let* ϕ *be a modal formula in simplified NNF. Let* Φ<sup>K</sup> = {{0} : tϕ}∪ρ<sup>K</sup> ({0} : t<sup>ϕ</sup> → ϕ)*. Let* Φ *with* Φ<sup>K</sup> ⊆ Φ *be a definition-complete set of* SNFsml *clauses, let* M = W, R, V, w0 *be a tree* K *model of* Φ *and let* M = W, R ,V,w0 *be such that*

*Then* M <sup>|</sup><sup>=</sup> <sup>Φ</sup>*.*

*(4a)* R ⊆ R *;*


*Then* M , w0 |= ϕ*.*

Theorems 3 and 4 now state the correctness of our transformation for KB.

**Theorem 3.** *Let* ϕ *be a modal formula in simplified NNF. Let* Φ<sup>B</sup> = {{0} : tϕ} ∪ ρKB ({0} : t<sup>ϕ</sup> → ϕ)*. If* ϕ *is* KB*-satisfiable, then* Φ<sup>B</sup> *is* K*-satisfiable.*

*Proof.* The main idea is to show that given a rooted KB model of ϕ, then a small variation of its unravelling is a rooted tree K model of ΦB.

Let M = W, R, V, w0 be a rooted KB model of ϕ with M,w0 |= ϕ and symmetric relationship <sup>R</sup>. Let W , R be the unravelling of W, R at <sup>w</sup>0. Let M <sup>B</sup> <sup>=</sup> W , R, V<sup>B</sup> ,(w0) where


Note that V<sup>B</sup> is well-defined as for every surrogate <sup>t</sup><sup>ψ</sup> <sup>∈</sup> <sup>T</sup>, <sup>ψ</sup> only contains propositional symbols in Q. Let Φ<sup>K</sup> = {{0} : tϕ} ∪ ρ<sup>K</sup> ({0} : t<sup>ϕ</sup> → ϕ).

We now consider the clauses occurring in Φ<sup>B</sup> and show that they hold in M <sup>B</sup> . By Lemma <sup>2</sup> it follows that M <sup>B</sup> <sup>|</sup><sup>=</sup> <sup>Φ</sup><sup>K</sup> . Also, all definitional clauses in <sup>Φ</sup><sup>B</sup> \ <sup>Φ</sup><sup>K</sup> are true in M <sup>B</sup> by Lemma 1.

Next consider clauses of the form

$$(1)\ S': \eta(\psi) \lor t\_{\Box \neg t\_{\Box \psi}} \qquad\qquad\text{(2)}\ S': t\_{\Box \neg t\_{\Box \psi}} \to \Box \neg t\_{\Box \psi}.$$

where t✷<sup>ψ</sup> is a surrogate for ✷ψ. These are not in Φ<sup>K</sup> . We show both are true in M <sup>B</sup> . We do so by first considering that <sup>t</sup>✷¬t✷<sup>ψ</sup> is true at a world and then that it is false.

Case (a): Let w <sup>∈</sup> M <sup>B</sup> [S ] with M <sup>B</sup> , w |<sup>=</sup> <sup>t</sup>✷¬t✷<sup>ψ</sup> . Clearly, M <sup>B</sup> , w |<sup>=</sup> <sup>η</sup>(ψ) <sup>∨</sup> <sup>t</sup>✷¬t✷<sup>ψ</sup> . Also, by definition of M <sup>B</sup> , M<sup>B</sup> , w |<sup>=</sup> ✷¬✷ψ. So, for every v <sup>∈</sup> W with w R <sup>v</sup>, M <sup>B</sup> , v |<sup>=</sup> <sup>¬</sup>✷ψ. As <sup>t</sup>✷<sup>ψ</sup> is a surrogate for ✷ψ, by definition of V <sup>B</sup>, v <sup>∈</sup> V <sup>B</sup>(t✷ψ) and M <sup>B</sup> , v |<sup>=</sup> <sup>¬</sup>t✷ψ. Thus, M <sup>B</sup> , w |<sup>=</sup> ✷¬t✷<sup>ψ</sup> and, by the semantics of implication, M <sup>B</sup> , w |<sup>=</sup> <sup>t</sup>✷¬t✷<sup>ψ</sup> <sup>→</sup> ✷¬t✷ψ.

Case (b): Let w <sup>∈</sup> M <sup>B</sup> [S ] with M <sup>B</sup> , w |<sup>=</sup> <sup>t</sup>✷¬t✷<sup>ψ</sup> . Clearly, by the semantics of implication, M <sup>B</sup> , w |<sup>=</sup> <sup>t</sup>✷¬t✷<sup>ψ</sup> <sup>→</sup> ✷¬t✷ψ. Also, by definition of V <sup>B</sup>, w ∈ V <sup>B</sup>(t✷¬t✷<sup>ψ</sup> ) implies M <sup>B</sup> , w |<sup>=</sup> ✷¬✷<sup>ψ</sup> which in turn implies M <sup>B</sup> , w |<sup>=</sup> ✸✷ψ. So, there exists v <sup>∈</sup> W with wR <sup>v</sup> and M <sup>B</sup> , v |<sup>=</sup> ✷ψ. Since trm is a p-morphism from M <sup>B</sup> to M, trm( w) R trm(v). Since R is symmetric, we also have trm(v) R trm( w) and by construction of M <sup>B</sup> , for u <sup>=</sup> v ◦ trm( w) we have v R <sup>u</sup>. Since M <sup>B</sup> , v |<sup>=</sup> ✷ψ, M <sup>B</sup> , u |<sup>=</sup> <sup>ψ</sup>. As trm is a p-morphism and M,trm(u) |<sup>=</sup> <sup>ψ</sup> and since trm( w) = trm(u), M,trm( w) |= ψ. By Lemma 3, from M,trm( w) |= ψ we obtain M <sup>B</sup> , w |<sup>=</sup> <sup>ψ</sup>. If <sup>ψ</sup> is a literal, then <sup>η</sup>(ψ) = <sup>ψ</sup> and M, w |<sup>=</sup> <sup>η</sup>(ψ). If <sup>ψ</sup> is not a literal, then <sup>η</sup>(ψ) = <sup>t</sup><sup>ψ</sup> and from M <sup>B</sup> , w |<sup>=</sup> <sup>ψ</sup>, by definition of V<sup>B</sup> , w <sup>∈</sup> V<sup>B</sup> (tψ) and M <sup>B</sup> , w |<sup>=</sup> <sup>t</sup>ψ. So, M, w |<sup>=</sup> <sup>η</sup>(ψ) <sup>∨</sup> <sup>t</sup>✷¬t✷<sup>ψ</sup> .

Thus, in both cases, for arbitrary w <sup>∈</sup> M <sup>B</sup> [S ], η(ψ)∨t✷¬t✷<sup>ψ</sup> and t✷¬t✷<sup>ψ</sup> → ✷¬t✷<sup>ψ</sup> and therefore Clauses (1) and (2) are true in M <sup>B</sup> .

**Theorem 4.** *Let* ϕ *be a modal formula in simplified NNF. Let* Φ<sup>B</sup> = {{0} : tϕ} ∪ ρKB ({0} : t<sup>ϕ</sup> → ϕ)*. If* Φ<sup>B</sup> *is* K*-satisfiable, then* ϕ *is* KB*-satisfiable.*

*Proof.* The main idea is to show that given a rooted tree K model of Φ<sup>B</sup> , its symmetric closure is a rooted KB model of ϕ.

Let <sup>M</sup> <sup>=</sup> W, R, V, w0 be a rooted tree <sup>K</sup> model of <sup>Φ</sup><sup>B</sup> . Let <sup>M</sup><sup>B</sup> <sup>=</sup> W, R<sup>B</sup> , <sup>V</sup> <sup>B</sup> , w0 be a structure such that


Let <sup>Φ</sup><sup>K</sup> <sup>=</sup> {{0} : <sup>t</sup>ϕ} ∪ <sup>ρ</sup><sup>K</sup> ({0} : <sup>t</sup><sup>ϕ</sup> <sup>→</sup> <sup>ϕ</sup>). We show that <sup>M</sup><sup>B</sup> <sup>|</sup><sup>=</sup> <sup>Φ</sup><sup>B</sup> satisfies the three preconditions of Lemma 4. By Lemma <sup>4</sup> this in turn implies that <sup>M</sup><sup>B</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup>.


Case (a): Assume wRv. As M,w |= t✷<sup>ψ</sup> and M,w |= t✷<sup>ψ</sup> → ✷η(ψ), we have M,w |= ✷η(ψ). As wRv, M,v |= η(ψ). As η(ψ) is a literal and <sup>V</sup> <sup>B</sup> <sup>=</sup> <sup>V</sup> we obtain M<sup>B</sup> , v |<sup>=</sup> <sup>η</sup>(ψ). So, M<sup>B</sup> , w |<sup>=</sup> <sup>t</sup>✷<sup>ψ</sup> <sup>→</sup> ✷η(ψ).

Case (b): Assume v is not reachable from w via R. Then wR<sup>B</sup> v was introduced by the symmetric closure operation on R and we must have vRw. That is, v is a R-predecessor of w and from w ∈ M[S] it follows that v ∈ M[S−]. So, (7) M,v |= η(ψ) ∨ t✷¬t✷<sup>ψ</sup> and (8) M,v |= t✷¬t✷<sup>ψ</sup> → ✷¬t✷ψ. From vRw, M,w |= t✷<sup>ψ</sup> and (8), it follows that M,v |= ¬t✷¬t✷<sup>ψ</sup> . This together with (7) implies M,v |<sup>=</sup> <sup>η</sup>(ψ). As <sup>η</sup>(ψ) is a literal and <sup>V</sup> <sup>B</sup> <sup>=</sup> <sup>V</sup> we obtain M<sup>B</sup> , v |<sup>=</sup> <sup>η</sup>(ψ). So, M<sup>B</sup> , w |<sup>=</sup> <sup>t</sup>✷<sup>ψ</sup> <sup>→</sup> ✷η(tψ).

Case (a) and Case (b) together show that Property (6) holds.

**–** For Condition (4c) let (9) S : t✷<sup>ψ</sup> → ✷t<sup>ψ</sup> be in Φ<sup>B</sup> , v, w ∈ W, mlM(w) = ml ∈ <sup>S</sup> (i.e., <sup>w</sup> <sup>∈</sup> <sup>M</sup>[S]) and w R<sup>B</sup> <sup>v</sup>. We need to show that there exists a clause S : ¬t<sup>ψ</sup> ∨ η<sup>f</sup> (ψ) in Φ<sup>B</sup> with v ∈ M[S ].

As in the previous case w R<sup>B</sup> v implies either wRv or vRw. In the first case mlM(v) = ml + 1 while in the second case mlM(v) = ml − 1.

As Φ<sup>B</sup> contains Clause (9), t<sup>ψ</sup> occurs at level ml+1 in Φ<sup>B</sup> . By definition of ρKB , Φ<sup>B</sup> also contains the clause (10) S<sup>−</sup> : t<sup>ψ</sup> ∨t✷¬t✷<sup>ψ</sup> . As ml ∈ S, ml−1 ∈ S<sup>−</sup> and therefore t<sup>ψ</sup> also occurs at level ml −1 in Φ<sup>B</sup> . By Theorem 2, Φ<sup>B</sup> is definitioncomplete, so there must be a clause S : ¬t<sup>ψ</sup> ∨ η<sup>f</sup> (ψ) in Φ<sup>B</sup> such that ml + 1 and ml − 1 in S .

**Theorem 5.** *Let* ϕ *be a modal formula in simplified NNF,* L ∈ {K,KB,KD,KT, K4,K5}*, and* Φ<sup>L</sup> = {{0} : tϕ} ∪ ρL({0} : t<sup>ϕ</sup> → ϕ)*. Then* ϕ *is* L*-satisfiable iff* Φ<sup>L</sup> *is* K*-satisfiable.*

## **5 Comparison With Related Work**

The approaches most closely related to ours are Kracht's reductions of normal modal logics to basic modal logic [11,12], the global modal resolution calculus [14], and Schmidt and Hustadt's axiomatic translation principle for translations of normal modal logics to first-order logic [24].

The first significant difference to our approach is that Kracht's reductions and the axiomatic translation exclude the modal operator ✸ from the language and only consider the modal operator ✷.

In order to present Kracht's approach, we need some additional notions. Let sf(ϕ), dg(ϕ), and |S| denote the set of all subformulae of ϕ, the maximum nesting of modal operators in ϕ, and the cardinality of the set S, respectively. Let ✸<sup>0</sup><sup>ψ</sup> <sup>=</sup> ✷<sup>0</sup><sup>ψ</sup> <sup>=</sup> ✷<sup>&</sup>lt;<sup>1</sup><sup>ψ</sup> <sup>=</sup> <sup>ψ</sup>, ✷<n+1<sup>ψ</sup> = (<sup>ψ</sup> <sup>∧</sup> ✷✷<nψ), ✷<sup>n</sup>+1<sup>ψ</sup> <sup>=</sup> ✷✷<sup>n</sup>ψ, and ✸<sup>n</sup>+1ψ = ✸✸<sup>n</sup>ψ. We can then define a reduction function ρ<sup>K</sup> <sup>L</sup> for a normal modal logic L in {KB,KD,KT,K4} as follows:

$$\rho\_L^{\mathbb{K}}(\varphi) = \begin{cases} \varphi \wedge \Box^{<|\mathfrak{sl}(\varphi)|+1} P\_{\mathbb{K}\mathfrak{l}}^{\mathbb{K}}(\varphi), & \text{for } L = \mathbb{K}4 \\ \varphi \wedge \Box^{<\deg(\varphi)+1} P\_L^{\mathbb{K}}(\varphi) & \text{otherwise} \end{cases}$$

where P <sup>K</sup> KB (ϕ)= {¬<sup>ψ</sup> <sup>→</sup> ✷¬✷<sup>ψ</sup> <sup>|</sup> ✷<sup>ψ</sup> <sup>∈</sup> sf(ϕ)} <sup>P</sup> <sup>K</sup> KD(ϕ)= {¬✷**false**} P <sup>K</sup> K4 (ϕ)= {✷<sup>ψ</sup> <sup>→</sup> ✷✷<sup>ψ</sup> <sup>|</sup> ✷<sup>ψ</sup> <sup>∈</sup> sf(ϕ)} <sup>P</sup> <sup>K</sup>

KT (ϕ)= {✷ψ → ψ | ✷ψ ∈ sf(ϕ)} Kracht shows that ϕ is L-satisfiable iff ρ<sup>K</sup> <sup>L</sup>(ϕ) is K-satisfiable. There are three differences to our approach. First, P <sup>K</sup> <sup>L</sup> (ϕ) will include an axiom instance for every occurrence of a subformula ¬✷ψ, equivalent to ✸¬ψ, in ϕ. In contrast, our approach requires no logic specific treatment of such subformulae. Second, the use of ✷<nP <sup>K</sup> <sup>L</sup> (ϕ) in ρ<sup>K</sup> <sup>L</sup> means that the axiom instance is available at every modal level. This means, for example, that for <sup>ϑ</sup><sup>1</sup> <sup>=</sup> ✸<sup>100</sup>(¬<sup>p</sup> <sup>∧</sup> ✷p), the formula ρK KT (ϑ1) contains the axiom instance ✷p → p over 100 times, although it is only required at the level at which ✷p occurs. Third, this is further compounded if the formula ψ in ✷ψ is itself a complex formula. We try to avoid that by using a surrogate propositional symbol t<sup>ψ</sup> instead, but this will only have a positive effect if the definitional clauses for t<sup>ψ</sup> do not have to be repeated.

The global modal resolution (GMR) calculus operates on SNF<sup>K</sup> clauses, that is, clauses of the form

$$
\Box^\*(\mathbf{start} \to \bigvee\_{b=1}^r l\_b) \quad \Box^\*(\mathbf{true} \to \bigvee\_{b=1}^r l\_b) \quad \Box^\*(l' \to \bigcirc l) \quad \Box^\*(l' \to \neg\bigcirc l)
$$


**Table 4.** Inference rules in [14] for K5 (EUC1 and EUC2).

where l, l , <sup>l</sup><sup>b</sup> are propositional literals with 1 <sup>≤</sup> <sup>b</sup> <sup>≤</sup> <sup>r</sup>, <sup>r</sup> <sup>∈</sup> <sup>N</sup>, and ✷<sup>∗</sup> is the universal operator. The calculus has specific inference rules for normal modal logics such as KB, KD, KT, K4, K5. Table 4 shows the two additional rules for K5, the only logic for which there are rules for both ✷ and ¬✷¬, i.e., ✸. These inference rules can be seen to perform an 'on-the-fly' computation of a transformation. Note that the clauses produced by PK5 differ from those produced by GMR for K5. Implicitly, our results here also show that it should be possible to eliminate EUC1 from the GMR calculus.

For the axiomatic translation, we only present the function P RS <sup>L</sup> that computes the logic dependent first-order clausal formulae that are part of the overall translation.

$$\begin{split} P^{\mathsf{RS}}\_{\mathsf{K}\mathsf{S}}(\Box\psi) &= \{ \forall x(\neg Q\_{\Box\psi}(y) \lor \neg R(x,y) \lor Q\_{\psi}(x)) \mid \Box\psi \in \mathsf{sf}(\varphi) \} \\ P^{\mathsf{RS}}\_{\mathsf{K}\mathsf{D}}(\Box\psi) &= \{ \forall x(\neg Q\_{\Box\psi}(x) \lor Q\_{\neg\Box\neg\psi}(x)) \mid \Box\psi \in \mathsf{sf}(\varphi) \} \\ P^{\mathsf{RS}}\_{\mathsf{K}\mathsf{T}}(\Box\psi) &= \{ \forall x(\neg Q\_{\Box\psi}(x) \lor Q\_{\psi}(x)) \mid \Box\psi \in \mathsf{sf}(\varphi) \} \\ P^{\mathsf{RS}}\_{\mathsf{K}\mathsf{I}}(\Box\psi) &= \{ \forall xy(\neg Q\_{\Box\psi}(x) \lor \neg R(x,y) \lor Q\_{\Box\psi}(y)) \mid \Box\psi \in \mathsf{sf}(\varphi) \} \\ P^{\mathsf{RS}}\_{\mathsf{K}\mathsf{S}}(\Box\psi) &= \{ \forall xy(\neg Q\_{\Box\psi}(y) \lor \neg R(x,y) \lor Q\_{\Box\psi}(x)), \\ &\qquad \forall xy(\neg Q\_{\Box\neg\Box\psi}(y) \lor \neg R(x,y) \lor Q\_{\Box\neg\Box\psi}(x)) \mid \Box\psi \in \mathsf{sf}(\varphi) \} \end{split}$$

The predicate symbols Q<sup>ψ</sup> correspond to our surrogate symbols tψ. The clausal formulae used in the treatment of KT and K4 are translations of the SNFml clauses we use (or vice versa). KB and K5 are handled in a different way as the first-order clausal formulae refer directly the accessibility relation and can therefore more easily express the transfer of information to a predecessor world. The universal quantification over worlds also means that the constraints expressed by the formulae hold at all modal levels without the need of any repetition.

In Section 6 we will also use the relational and semi-functional translation of modal logics to first-order logic combined with structural transformation to clause normal form. In both approaches ✷ψ is translated as ∀xy(¬Q✷ψ(x) ∨ ¬R(x, y)∨Qψ, while ✸ψ becomes ∀x∃y(¬Q✸ψ(x)∨R(x, y)) and ∀x∃α(¬Q✸ψ(x)∨ R(x, [xα])) in the relational and semi-functional translation, respectively. Then, depending on the modal logics, further formulae representing the semantic properties of the accessibility R are added. For the relational translation these will simply be the formulae in the fourth column of Table 1. The semi-functional translation uses collections of partial accessibility function in addition to the accessibility relation. A predicate def is used to represent on which worlds a partial

accessibility function is defined. For each modal logic there is then again a background theory consisting of formulae over def and R that represents the properties of the underlying accessibility relation which is added to the translation of a formula. For example, for K5 the background theory is: ∀xy∀αβ((¬def(x) ∨ def(y))∧(¬def(w0)∨R(w0, [w0α]))∧(¬def(x)∨ ¬def(y)∨R([xα], [yβ]))), where w<sup>0</sup> is a constant representing the root world in a rooted Kripke structure.

## **6 Evaluation**

We have compared the performance of the following approaches: (i) the combination of our reductions with the modal-layered resolution (MLR) calculus for SNFml clauses [15] implemented in the modal theorem prover KSP, with three different refinements for resolution inferences on labelled propositional clauses; (ii) the global modal resolution (GMR) calculus, also implemented in KSP, with three different refinements for resolution inferences on propositional clauses; (iii) the combinations of the relational and semi-functional translation of modal logics to first-order logic with ordered first-order resolution implemented in the first-order theorem prover SPASS. In total this gives us eight different approaches to compare. The axiomatic translation is currently not implemented in SPASS. Other provers, such as LEO-III [26], LWB [9], MleanCoP [21], do not have built-in support for the full range of logics considered here. LoTREC 2.0 [7] supports all the logics, but is not intended as automatic theorem prover.

The modal-layered resolution calculus operates on SNFml clauses, that is, clauses of the form

$$ml: \bigvee\_{b=1}^r l\_b \qquad ml: l' \to \Box l \qquad ml: l' \to \Diamond l$$

where ml <sup>∈</sup> <sup>N</sup> ∪ {} and <sup>l</sup>, <sup>l</sup> , l<sup>b</sup> are propositional literals with 1 ≤ b ≤ r, <sup>r</sup> <sup>∈</sup> <sup>N</sup>. In the implementation of the reductions presented in Section 3, we take a SNFsml clause S : ψ simply as an abbreviation of the set of SNFml clauses {ml : ψ | ml ∈ S}. Note that this also means that we will have to repeat similar resolution inferences for different modal levels.

KSP [13] implements the reductions presented in Section 3 as well as a normal form transformation of modal formulae to sets of SNF<sup>K</sup> clauses. It implements both the MLR and the GMR calculus. Resolution inferences between (labelled) propositional clauses can either be unrestricted (cplain option), restricted by an ordering (cord option), that is, clauses can only be resolved on their maximal literals with respect to an ordering chosen by the prover in such a way to preserve completeness, restricted to negative resolution (cneg option), that is, one of the premises in an inference has to be a negative clause, or restricted to positive resolution. We do not include the last option in our evaluation as it typically performs worse. KSP also implements a range of simplification rules that are applied to modal formulae before their transformation to normal form. Of those we have enabled pure literal elimination (early ple option), simplification using the Box Normal Form [22] and Prenex Normal Form (bnfsimp and prenex


**Table 5.** Experimental results on LWB benchmark collection

options) [17]. For clause processing, unit resolution and pure elimination are enabled (unit, lhs unit, and ple options).

SPASS 3.9 [27,28] supports automated reasoning in extended modal logics, including all logics considered here, PDL-like modal logics as well as description logics. It includes eight different translations of modal logics to first-order logic. In our evaluation we have used the relational translation and the semifunctional translation. For the local satisfiability problem in KB to K5, for the relational translation we have added the first-order frame properties given in Table 1 while for the semi-functional translation we have added the background theories devised by Nonnengart [20]. For the transformation to first-order clausal form, we have enabled renaming of quantified subformulae. The only inference rules used are ordered resolution and ordered factoring, the reduction rules used are condensing, backward subsumption and forward subsumption. For the relational and semi-functional translation for K, KB, KD, and KT we thereby obtain a decision procedure, while for the other logics we do not. For K4 and K5, the fragment of first-order clausal logic corresponding to the semi-functional translation of modal formula and their background theories is decidable by ordered resolution with selection [25]. However, the non-trivial ordering and selection function required is not currently implemented in SPASS.

For our evaluation we have chosen the LWB basic modal logic benchmark collection [2], with 20 formulae in each of 18 parameterised classes. For K, all formulae in 9 classes are satisfiable while all formulae in the other 9 classes are unsatisfiable. In their negation normal form, 63% of modal operators are ✷ and 37% are ✸ operators. We have used the collection for each of the six logics. If a formula is unsatisfiable in K then it remains unsatisfiable in the other five logics, while the opposite is not true. As we move to logics other than K, it is also no longer the case that all formulae in a class have the same satisfiability status.

The third column in Table 5 indicates the total number of satisfiable and unsatisfiable formulae for each logic. In the last two lines of the table we sum up the results for all logics. The last eight columns in the table show how many formulae each of the approaches were able to solve with a time limit of 100 CPU seconds for each formula. Benchmarking was performed on a PC with an AMD Ryzen 5 5600X CPU @ 4.60GHz max and 32GB main memory using Fedora release 33 as operating system.

As we can see, the new reductions combined with the modal-layered resolution (MLR) calculus and ordered resolution refinement (cord) perform best, achieving the highest number of solved formulae in 8 out of 12 individual categories in the table, on two of those equal with the global modal resolution (GMR) calculus. On 3 categories, GMR outperfoms MLR. On both satisfiable and unsatisfiable formulae in K5 this can be seen as evidence that 'on-the-fly' transformation offers a (slight) advantage over our approach given that the additional clauses hold universally in both approaches. For SPASS we see a clear advantage of the semi-functional translation over the relational one, on both satisfiable and unsatisfiable formulae.

# **7 Conclusion and Future Work**

We have presented new reductions of propositional modal logics KB, KD, KT, K4, K5 to Separated Normal Form with Sets of Modal Levels. We have shown experimentally that these reductions allow us to reason effectively in these logics.

The obvious next step is to consider extensions of the basic modal logic K with combinations of the axioms B, D, T, 4, and 5. Unfortunately, a simple combination of the reductions for each of the axioms is not sufficient to obtain a satisfiability-preserving reduction for the such modal logics. An example is the simple formula ¬p ∧ ✸✸✷p which is KB4-unsatisfiable. If we define

$$\begin{aligned} P\_{\mathsf{KB4}}(S:t\_{\Box\psi}\to\Box\psi) &= P\_{\mathsf{KB}}(S:t\_{\Box\psi}\to\Box\psi)\cup P\_{\mathsf{KA}}(S:t\_{\Box\psi}\to\Box\psi),\\ \Delta\_{\mathsf{KB4}}(S:t\_{\Box\psi}\to\Box\psi) &= \delta\_{\mathsf{KB4}}(\star,\psi),\end{aligned}$$

that is, PKB4 is the union of PKB and PK4 , then the clause set obtained from {{0} : t0} ∪ ρKB4 ({0} : t<sup>0</sup> → ¬p ∧ ✸✸✷p) is K-satisfiable. The same issue also occurs in the axiomatic translation of modal logics to first-order logic where the translation for KB4 is not simply the combination of the translations for KB and K4 [24, Theorem 5.6]. We are currently exploring solutions to this problem.

Regarding practical applications, it would be advantageous to have an implementation of a calculus that operates directly SNFsml clauses. This would greatly reduce the number of inference steps performed on satisfiable formulae and simplify proof search in general. Again, such an implementation is future work.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **Isabelle's Metalogic: Formalization and Proof Checker***-*

Tobias Nipkow and Simon Roßkopf

Technical University of Munich, Munich, Germany

**Abstract.** Isabelle is a generic theorem prover with a fragment of higherorder logic as a metalogic for defining object logics. Isabelle also provides proof terms. We formalize this metalogic and the language of proof terms in Isabelle/HOL, define an executable (but inefficient) proof term checker and prove its correctness w.r.t. the metalogic. We integrate the proof checker with Isabelle and run it on a range of logics and theories to check the correctness of all the proofs in those theories.

#### **1 Introduction**

One of the selling points of proof assistants is their trustworthiness. Yet in practice soundness problems do come up in most proof assistants. Harrison [11] distinguishes errors in the logic and errors in the implementation (and cites examples). Our work contributes to the solution of both problems for the proof assistant Isabelle [31]. Isabelle is a generic theorem prover: it implements M, a fragment of intuitionistic higher-order logic, as a metalogic for defining object logics. Its most developed object logic is HOL and the resulting proof assistant is called Isabelle/HOL [25,24]. The latter is the basis for our formalizations.

Our first contribution is the first complete formalization of Isabelle's metalogic. Thus our work applies to all Isabelle object logics, e.g. not just HOL but also ZF. Of course Paulson [30] describes M precisely, but only on paper. More importantly, his description does not cover polymorphism and type classes, which were introduced later [26]. The published account of Isabelle's proof terms [4] is also silent about type classes. Yet type classes are a significant complication (as, for example, Kunˇcar and Popescu [18] found out).

Our second contribution is a verified (against M) and executable checker for Isabelle's proof terms. We have integrated the proof checker with Isabelle. Thus we can guarantee that every theorem whose proof our proof checker accepts is provable in our definition of M. So far we are able to check the correctness of moderatly sized theories across the full range of logics implemented in Isabelle.

Although Isabelle follows the LCF-architecture (theorems that can only be manufactured by inference rules) it is based on an infrastructure optimized for

<sup>-</sup> Supported by Wirtschaftsministerium Bayern under DIK-2002-0027//DIK0185/03 and DFG GRK 2428 ConVeY

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5\_6 93–110, 2021.

performance. In particular, this includes multithreading, which is used in the kernel and has once lead to a soundness issue<sup>1</sup> . Therefore we opt for the "certificate checking" approach (via proof terms) instead of verifying the implementation.

This is the first work that deals directly with what is implemented in Isabelle as opposed to a study of the metalogic that Isabelle is meant to implement. Instead of reading the implementation you can now read and build on the more abstract formalization in this paper. The correspondence of the two can be established for each proof by running the proof checker.

Our formalization reflects the ML implementation of Isabelle's terms and types and some other data structures. Thus a few implementation choices are visible, e.g. De Bruijn indices. This is necessary because we want to integrate our proof checker as directly as possible with Isabelle, with as little unverified glue code as possible, for example no translation between De Bruijn indices and named variables. We refer to this as our *intentional implementation bias*. In principle, however, one could extend our formalization with different representations (e.g. named terms) and prove suitable isomorphisms. Our work is purely proof theoretic; semantics is out of scope.

The formalization can be found in the Archive of Formal Proofs[28].

#### **2 Related Work**

Harrison [11] was the first to verify some of HOL's metatheory and an implementation of a HOL kernel in HOL itself. Kumar *et al.* [13] formalized HOL including definition principles, proved its soundness and synthesized a verified kernel of a HOL prover down to the machine language level. Abrahamsson [2] verified a proof checker for the OpenTheory [12] proof exchange format for HOL.

Wenzel [38] showed how to interpret type classes as predicates on types. We follow his approach of reflecting type classes in the logic but cannot remove them completely because of our intentional implementation bias (see above). Kunˇcar and Popescu [15,16,17,18] focus on the subtleties of definition principles for HOL with overloading and prove that under certain conditions, type and constant definitions preserve consistency. ˚Aman Pohjola *et al.* [1] formalize [15,18].

Adams [3] presents HOL Zero, a basic theorem prover for HOL that addresses the problem of how to ensure that parser and pretty-printer do not misrepresent formulas.

Let us now move away from Isabelle and HOL. Sozeau *et al.* [36] present the first implementation of a type checker for the kernel of Coq that is proved correct in Coq with respect to a formal specification. Carneiro [6] has implemented a highly performant proof checker for a multi-sorted first order logic and is in the process of verifying it in its own logic.

We formalize a logic with bound variables, and there is a large body of related work that deals with this issue (e.g. [37,21,7]) and a range of logics and systems with special support for handling bound variables (e.g. [33,34,35]). We found that De Bruijn indices worked reasonably well for us.

<sup>1</sup> https://mailmanbroy.in.tum.de/pipermail/isabelle-dev/2016-December/007251.html

## **3 Preliminaries**

Isabelle types are built from type variables, e.g. *a*, and (postfix) type constructors, e.g. *a list*; the function type arrow is ⇒. Isabelle also has a type class system explained later. The notation t :: τ means that term t has type τ . Isabelle/HOL provides types *a set* and *a list* of sets and lists of elements of type *a*. They come with the following vocabulary: function set (conversion from lists to sets), (#) (list constructor), (@) (append), |*xs*| (length of list *xs*), *xs* ! *i* (the *i*th element of *xs* starting at 0), list-all2 *p* [*x* <sup>1</sup>, . . ., *x* <sup>m</sup>] [*y*1, . . ., *y*n]=(*m* = *n* ∧ *p x* <sup>1</sup> *y*<sup>1</sup> ∧ ... ∧ *p x* <sup>n</sup> *y*n) and other self-explanatory notation.

The Field of a relation *r* is the set of all *x* such that (*x* , ) or ( ,*x* ) is in *r*.

There is also the predefined data type

**datatype** - *a option* <sup>=</sup> None <sup>|</sup> Some - *a*

The type τ <sup>1</sup> τ <sup>2</sup> abbreviates τ <sup>1</sup> ⇒ τ <sup>2</sup> *option*, i.e. partial functions, which we call *maps*. Maps have a domain and a range:

dom *m* = {*a* | *m a* = None} ran *m* = {*b* | ∃ *a*. *m a* = Some *b*}.

Logical equivalence is written = instead of ←→.

## **4 Types and Terms**

A *name* is simply a string. Variables have type *var* ; their inner structure is immaterial for the presentation of the logic.

The logic has three layers: terms are classified by types as usual, but in addition types are classified by *sorts*. A *sort* is simply a set of class names. We discuss sorts in detail later.

Types (typically denoted by *T*, *U*, . . . ) are defined like this:

**datatype** *typ* <sup>=</sup> Ty *name* (*typ list*) <sup>|</sup> Tv *var sort*

where Ty κ [*T*1,...,*T*n] represents the Isabelle type (*T*1,. . .,*T*n) κ and Tv *a S* represents a type variable *a* of sort *S* — sorts are directly attached to type variables. The notation *T* → *U* is short for Ty *"fun"* [*T*,*U* ], where *"fun"* is the name of the function type constructor.

Isabelle's terms are simply typed lambda terms in De Bruijn notation:

**datatype** *term* <sup>=</sup> Ct *name typ* <sup>|</sup> Fv *var typ* <sup>|</sup> Bv *nat* <sup>|</sup> Abs *typ term* <sup>|</sup> (*·*) *term term*

A term (typically *r*, *s*, *t*, *u* . . . ) can be a typed constant Ct *c T* or free variable Fv *v T*, a bound variable Bv *n* (a De Brujin index), a typed abstraction Abs *T t* or an application *t · u*.

The term-has-type proposition has the syntax *Ts* <sup>τ</sup> *t* : *T* where *Ts* is a list of types, the context for the type of the bound variables.

<sup>τ</sup> Ct *T* : *T* <sup>τ</sup> Fv *T* : *T i* < |*Ts*| *Ts* <sup>τ</sup> Bv *i* : *Ts* ! *i*

$$\begin{array}{c} T \ \# \ T s \vdash\_{\tau} t : T' \\ \hline Ts \vdash\_{\tau} \mathsf{Abs} \ T \ t : T \to T' \\ \hline Ts \vdash\_{\tau} u : U \qquad Ts \vdash\_{\tau} t : U \to T \\ \hline Ts \vdash\_{\tau} t \cdot u : T \end{array}$$

We define <sup>τ</sup> *t* : *T* = [] <sup>τ</sup> *t* : *T*.

Function fv :: *term* ⇒ (*var* × *typ*) *set* collects the free variables in a term. Because bound variables are indices, fv *t* is simply the set of all (*v*, *T*) such that Fv *v T* occurs in *t*. The type is an integral part of a variable.

A *type substitution* is a function of type *var* ⇒ *sort* ⇒ *typ*. It assigns a type to each type variable and sort pair. We write \$\$ *T* or \$\$ *t* for the overloaded function which applies such a type substitution to all type variables (and their sort) occurring in a type or term. The *type instance* relation is defined like this:

*T*<sup>1</sup> - *<sup>T</sup>*<sup>2</sup> = (<sup>∃</sup> -. -\$\$ *T*<sup>2</sup> = *T*1)

We also need to β-contract a term Abs *T t · u* to something like "*t* with Bv *0* replaced by *u*". We define a function subst-bv such that subst-bv *u t* is that β-contractum. The definition of subst-bv is shown in the Appendix and can also be found in the literature (e.g. [23]).

In order to abstract over a free (term) variable there is a function bind-fv (*v*, *T*) *t* that (roughly speaking) replaces all occurrences of Fv *v T* in *t* by Bv *0*. Again, see the Appendix for the definition. This produces (if Fv *v T* occurs in *t*) a term with an unbound Bv *0*. Function Abs-fv binds it with an abstraction:

Abs-fv *vTt=* Abs *T* (bind-fv (*v*, *T*) *t*)

While this section described the syntax of types and terms, they are not necessarily wellformed and should be considered pretypes/preterms. The wellformedness checks are described later.

#### **5 Classes and Sorts**

Isabelle has a built-in system of type classes [22] as in Haskell 98 except that class constraints are directly attached to variable names: our *Tv a* [*C* ,*D*,. . .] corresponds to Haskell's (C a, D a, ...) => ... a ....

A *sort* is Isabelle's terminology for a set of (class) names, e.g. {*C* ,*D*,. . .}, which represent a conjunction of class constraints. In our work, variables *S*, *S* etc. stand for sorts.

Apart from the usual application in object logics, type classes also serve an important metalogical purpose: they allow us to restrict, for example, quantification in object logics to object-level types and rule out meta-level propositions.

Isabelle's type class system was first presented in a programming language context [29,27]. We give the first machine-checked formalization. The central data structure is a so-called *order-sorted signature*. Intuitively, it is comprised of a set of class names, a partial subclass ordering on them and a set of *type constructor signatures*. A type constructor signature κ :: (*S* <sup>1</sup>, . . ., *S* <sup>k</sup>) *c* for a type constructor κ states that applying κ to types *T*1, . . ., *T*<sup>k</sup> such that *T*<sup>i</sup> has sort *S*<sup>i</sup> (defined below) produces a type of class *c*. Formally:

# **type synonym** *osig* = ((*name* <sup>×</sup> *name*) *set* <sup>×</sup> (*name* (*class sort list*)))

To explain this formalization we start from a pair (*sub*,*tcs*) :: *osig* and recover the informal order-sorted signature described above. The set of classes is simply the Field of the *sub* relation. The *tcs* component represents the set of all type constructor signatures κ :: (*Ss*) *c* (where *Ss* is a list of sorts) such that *tcs* κ = Some *dm* and *dm c* = Some *Ss*. Representing κ :: (*Ss*) *c* as a triple, we define

$$TCS = \{ (\kappa, Ss, c) \mid \exists \, domf. \; tcs \; \kappa = \mathsf{Some} \; domf \land \; domf \; c = \mathsf{Some} \; Ss \} $$

*TCS* is the translation of *tcs*, the data structure close to the implementation, to an equivalent but more intuitive version *TCS* that is close to the informal presentations in the literature.

The subclass ordering *sub* can be extended to a subsort ordering as follows:

$$S\_1 \le\_{sub} S\_2 = \left( \forall \ c\_2 \in S\_2. \ \exists \ c\_1 \in S\_1. \ c\_1 \le\_{sub} c\_2 \right)$$

The smaller sort needs to subsume all the classes in the larger sort. In particular {*c*1} ≤*sub* {*c*2} iff (*c*1, *c*2) ∈ *sub*.

Now we can define a predicate has-sort that checks whether, in the context of some order-sorted signature (*sub*,*tcs*), a type fulfills a given sort constraint:

*S* ≤*sub S* has-sort (*sub*, *tcs*) (Tv *a S*) *S tcs* κ = Some *dm* ∀ *c*∈*S*. ∃ *Ss*. *dm c* = Some *Ss* ∧ list-all2 (has-sort (*sub*, *tcs*)) *Ts Ss* has-sort (*sub*, *tcs*) (Ty κ *Ts*) *S*

The rule for type variables uses the subsort relation and is obvious. A type (*T*1, . . ., *T*n) κ has sort {*c*1, ...} if for every *c*<sup>i</sup> there is a signature κ :: (*S* <sup>1</sup>, . . ., *S* <sup>n</sup>) *c*<sup>i</sup> and has-sort (*sub*, *tcs*) *T*<sup>j</sup> *S*<sup>j</sup> for *j* = *1* , . . ., *n*.

We *normalize* a sort by removing "superfluous" class constraints, i.e. retaining only those classes that are not subsumed by other classes. This gives us unique representatives for sorts which we call *normalized*:

normalize-sort *sub S* = {*c* ∈ *S* | ¬ (∃ *c* ∈*S*. (*c* , *c*) ∈ *sub* ∧ (*c*, *c* ) ∈/ *sub*)} normalized-sort *sub S* = (normalize-sort *sub S* = *S*)

We work with normalized sorts because it simplifies the derivation of efficient executable code later on.

Now we can define wellformedness of an *osig*:

wf-osig (*sub*, *tcs*)=(wf-subclass *sub* <sup>∧</sup> wf-tcsigs *sub tcs*)

A sublass relation is wellformed if it is a partial order where reflexivity is restricted to its Field. Wellformedness of type constructor signatures (wf-tcsigs) is more complex. We describe it in terms of *TCS* derived from *tcs* (see above). The conditions are the following:

	- ∀ (κ, *Ss*1, *c*1)∈*TCS*. ∀ *c*2. (*c*1, *c*2) ∈ *sub* −→

These conditions are used in a number of places to show that the type system is well behaved. For example, has-sort is upward closed:

wf-osig (*sub*, *tcs*) ∧ has-sort (*sub*, *tcs*) *T S* ∧ *S* ≤*sub S* −→ has-sort (*sub*, *tcs*) *T S*

# **6 Signatures**

A *signature* consist of a map from constant names to their (most general) types, a map from type constructor names to their arities, and an order-sorted signature:

**type synonym** *signature* = (*name typ*) <sup>×</sup> (*name nat*) <sup>×</sup> *osig*

The three projection functions are called const-type, type-arity and osig. We now define a number of wellformedness checks w.r.t. a signature Σ. We start with wellformedness of types, which is pretty obvious:

$$\begin{aligned} \text{type-arity } \Sigma \text{ } \kappa &= \text{Some} \mid Ts \mid \begin{array}{c} \forall \ T \in \mathsf{set} \ Ts. \ \mathsf{wf-type } \Sigma \ \mathsf{T} \\ \hline \mathsf{wf-type } \Sigma \ (\mathsf{Ty} \ \kappa \ \mathsf{T}s) \end{array} \\ \begin{array}{c} \mathsf{wf-sort} \ (\mathsf{subclcl} \mathsf{class} \ (\mathsf{osig } \Sigma)) \ \ S \\ \hline \mathsf{wf-type } \Sigma \ (\mathsf{Ty} \ a \ S) \end{array} \end{aligned}$$

Wellformedness of a term essentially just says that all types in the term are wellformed and that the type *T* of a constant in the term must be an instance of the type *T* of that constant in the signature: *T* -*T*.

$$\begin{array}{cccc} \begin{array}{c} \text{wf-type } \Sigma \text{ } T\\ \text{wf-term } \Sigma \text{ } (\mathsf{F} \nu \text{ } v \text{ } T) \end{array} & \begin{array}{c} \text{wf-term } \Sigma \text{ } (\mathsf{B} \nu \text{ } n) \\\\ \text{wf-type } \Sigma \text{ } s = \text{Some } T \end{array} & \begin{array}{c} \text{wf-type } \Sigma \text{ } T' \qquad T' \overset{\sim}{\leq} \, T \end{array} \\ \begin{array}{c} \text{wf-term } \Sigma \text{ } t \qquad \text{wf-term } \Sigma \text{ } u \\\\ \text{wf-term } \Sigma \text{ } (t \cdot u) \end{array} \\ \begin{array}{c} \text{wf-type } \Sigma \text{ } T \qquad \text{wf-term } \Sigma \text{ } t \\\\ \text{wf-term } \Sigma \text{ } (\mathsf{A} \mathsf{b} \mathsf{s} \text{ } T \; t) \end{array} \end{array}$$

These rules only check whether a term conforms to a signature, not that the contained types are consistent. Combining wellformedness and <sup>τ</sup> yields welltypedness of a term:

wt-term <sup>Σ</sup> *<sup>t</sup>* = (wf-term <sup>Σ</sup> *<sup>t</sup>* <sup>∧</sup> (<sup>∃</sup> *<sup>T</sup>*. <sup>τ</sup> *<sup>t</sup>* : *<sup>T</sup>*))

Wellformedness of a signature Σ = (*ctf* , *arf* , *oss*) where *oss* = (*sub*, *tcs*) is defined as follows:

```
wf-sig Σ =
((∀ T∈ran ctf . wf-type Σ T) ∧ wf-osig oss ∧ dom tcs = dom arf ∧
(∀ κ dm. tcs κ = Some dm −→ (∀ Ss∈ran dm. arf κ = Some |Ss|)))
```
In words: all types in *ctf* are wellformed, *oss* is wellformed, the type constructors in *tcs* are exactly those that have an arity in *arf*, for every type constructor signature (κ, *Ss*, ) in *tcs*, κ has arity |*Ss*|.

# **7 Logic**

Isabelle's metalogic M is an extension of the logic described by Paulson [30]. It is a fragment of intuitionistic higher-order logic. The basic types and connectives of M are the following:


The type subscripts of and ≡ are dropped in the text if they can be inferred.

Readers familiar with Isabelle syntax must keep in mind that for readability we use the symbols -, =⇒ and ≡ for the *encodings* of the respective symbols in Isabelle's metalogic. We avoid the corresponding metalogical constants completely in favour of HOL's ∀ , −→, = and inference rule notation.

The provability judgment of M is of the form Θ,Γ *t* where Θ is a theory, Γ (the hypotheses) is a set of terms of type prop and *t* a term of type prop.

A *theory* is a pair of a signature and a set of axioms:

**type synonym** *theory* <sup>=</sup> *signature* <sup>×</sup> *term set*

The projection functions are sig and axioms. We extend the notion of wellformedness from signatures to theories:

wf-theory (Σ, *axs*) = (wf-sig <sup>Σ</sup> <sup>∧</sup> (<sup>∀</sup> *<sup>p</sup>*∈*axs*. wt-term <sup>Σ</sup> *<sup>p</sup>* ∧ <sup>τ</sup> *<sup>p</sup>* : prop) <sup>∧</sup> is-std-sig <sup>Σ</sup> <sup>∧</sup> eq-axs <sup>⊆</sup> *axs*)

The first two conjuncts need no explanation. Predicate is-std-sig (not shown) requires the signature to have certain minimal content: the basic types (→, prop) and constants (≡, -, =⇒) of M and the additional types and constants for type class reasoning from Section 7.3. Our theories also need to contain a minimal set of axioms. The set eq-axs is an axiomatic basis for equality reasoning and will be explained in Section 7.2.

We will now discuss the inference system in three steps: the basic inference rules, equality and type class reasoning.

#### **7.1 Basic Inference Rules**

The *axiom rule* states that wellformed type-instances of axioms are provable:

wf-theory Θ *t* ∈ axioms Θ wf-inst Θ Θ,Γ \$\$ *t*

where :: *var* ⇒ *sort* ⇒ *typ* is a type substitution and \$\$ denotes its application (see Section 4). The types substituted into the type variables need to be wellformed and conform to the sort constraint of the type variable:

$$\begin{array}{l} \mathsf{wf-inst} \ (\Sigma, \; axs) \ \varrho =\\ \left(\forall \ v \ S. \; \varrho \ v \ S \neq \ \mathsf{T} \ v \ S \longrightarrow \ \mathsf{hass-sort} \ (\mathsf{csig} \ \Sigma) \ (\varrho \ v \ S) \ S \wedge \; \mathsf{wf-type} \ \Sigma \ (\varrho \ v \ S)\right) \end{array}$$

The conjunction only needs to hold if actually changes something, i.e. if *v S* = Tv *v S*. This condition is not superfluous because otherwise has-sort *oss* (Tv *v S*) *S* and wf-type Σ (Tv *v S*) only hold if *S* is wellformed w.r.t Σ.

Note that there are no extra rules for general instantiation of type or term variables. Type variables can only be instantiated in the axioms. Term instantiation can be performed using the forall introduction and elimination rules.

The *assumption rule* allows us to prove terms already in the hypotheses:

$$\frac{\text{wf-term } (\text{sig } \Theta) \ t \qquad \vdash\_{\tau} t : \text{prop} \qquad t \in \varGamma}{\Theta, \varGamma \vdash t}$$

Both and =⇒ are characterized by introduction and elimination rules:

wf-theory Θ Θ,Γ *t* (*x* , *T*) ∈/ FV Γ wf-type (sig Θ) *T* Θ,Γ <sup>T</sup> (Abs-fv *xTt*) Θ,Γ <sup>T</sup> (Abs *T t*) <sup>τ</sup> *u* : *T* wf-term (sig Θ) *u* Θ,Γ subst-bv *u t* wf-theory Θ Θ,Γ *u* wf-term (sig Θ) *t* <sup>τ</sup> *t* : prop Θ,Γ − {*t*} *t* =⇒ *u* Θ,Γ<sup>1</sup> *t* =⇒ *u* Θ,Γ<sup>2</sup> *t* Θ,Γ<sup>1</sup> ∪ Γ<sup>2</sup> *u*

where FV Γ = ( *<sup>t</sup>*∈<sup>Γ</sup> fv *<sup>t</sup>*).

#### **7.2 Equality**

Most rules about equality are not part of the inference system but are axioms (the set eq-axs mentioned above). Consequences are obtained via the axiom rule.

The first three axioms express that ≡ is reflexive, symmetric and transitive:

$$x \equiv x \qquad x \equiv y \implies y \equiv x \qquad x \equiv y \implies y \equiv z \implies x \equiv z$$

The next two axioms express that terms of type prop (*A* and *B*) are equal iff they are logically equivalent:

$$A \equiv B \implies A \implies B \qquad (A \implies B) \implies (B \implies A) \implies A \equiv B$$

The last equality axioms are congruence rules for application and abstraction:

$$f \equiv g \implies x \equiv y \Longrightarrow (f \cdot x) \equiv (g \cdot y)$$

$$\bigwedge \text{ (Abs } T \text{ (} (f \cdot \mathcal{B} \nu \, 0) \equiv (g \cdot \mathcal{B} \nu \, 0))) \Longrightarrow \text{Abs } T \text{ (} f \cdot \mathcal{B} \nu \, 0) \equiv \text{Abs } T \text{ (} g \cdot \mathcal{B} \nu \, 0)$$

Paulson [30] gives a slightly different congruence rule for abstraction, which allows to abstract over an arbitrary, free *x* in *f* ,*g*. We are able to derive this rule in our inference system.

Finally there are the lambda calculus rules. There is no need for α conversion because α-equivalent terms are already identical thanks to the De Brujin indices for bound variables. For β and η conversion the following rules are added. In contrast to the rest of this subsection, these are not expressed as axioms.

$$\begin{array}{c c c c} & \mathsf{wf\text{-}theory} & \Theta\\ \mathsf{wt\text{-}term (sig } \Theta) \ (\mathsf{Abs} \ T \ t) & \mathsf{wf\text{-}term (sig } \Theta) \ u & \mathsf{\vdash\_{\tau} u \ :\ T}{}(\beta) \\ \hline & \Theta, \Gamma \vdash (\mathsf{Abs} \ T \ t \cdot u) \equiv \mathsf{subst} \ \mathsf{st} \ \mathsf{b} \ u \ t \\ \hline & \mathsf{wf\text{-}theory } \Theta & \mathsf{wf\text{-}term (sig } \Theta) \ t & \mathsf{\vdash\_{\tau} t \ :\ T \to T' }(\eta) \\ \hline & \Theta, \Gamma \vdash \mathsf{Abs} \ T \ (t \cdot \mathcal{B} \mathsf{v} \ 0) \equiv t \end{array}$$

Rule (β) uses the substitution function subst-bv as explained in Section 4 (and defined in the Appendix).

Rule (η) requires a few words of explanation. We do not explicitly require that *t* does not contain Bv *0*. This is already a consequence of the precondition that <sup>τ</sup> *t* : *T* → *T* : it implies that *t* is closed. For that reason it is perfectly unproblematic to remove the abstraction above *t*.

#### **7.3 Type Class Reasoning**

Wenzel [38] encoded class constraints of the form "type *T* has class *c*" in the term language as follows. There is a unary type constructor named *"itself"* and *T* itself abbreviates Ty *"itself"* [*T*]. The notation *TYPET* itself is short for Ct *"type"* (*T* itself ) where *"type"* is the name of a new uninterpreted constant. You should view *TYPET* itself as the term-level representation of type *T*.

Next we represent the predicate "is of class *c*" on the term level. For this we define some fixed injective mapping const-of-class from class to constant names. For each new class *c* a new constant const-of-class *c* of type *T* itself → prop is added. The term Ct (const-of-class *c*) (*T* itself → prop) *· TYPE<sup>T</sup>* itself represents the statement "type *T* has class *c*". This is the inference rule deriving such propositions:

wf-theory Θ const-type (sig Θ) (const-of-class *C* ) = Some ( *a* itself → prop) wf-type (sig Θ) *T* has-sort (osig (sig Θ)) *T* {*C* } Θ,Γ Ct (const-of-class *C* ) (*T* itself → prop) *· TYPE<sup>T</sup>* itself

This is how the has-sort inference system is integrated into the logic.

This concludes the presentation of M. We have shown some minimal sanity properties, incl. that all provable terms are of type prop and wellformed:

**Theorem 1.** Θ,Γ *t* −→ <sup>τ</sup> *t* : prop ∧ wf-term (sig Θ) *t*

The attentive reader will have noticed that we do not require unused hypotheses in Γ to be wellformed and of type prop. Similarly, we only require wf-theory Θ in rules that need it to preserve wellformedness of the terms and types involved. To restrict to wellformed theories and hypotheses we define a top-level provability judgment that requires wellformedness:

Θ,Γ *<sup>t</sup>* = (wf-theory <sup>Θ</sup> <sup>∧</sup> (<sup>∀</sup> *<sup>h</sup>*∈Γ . wf-term (sig <sup>Θ</sup>) *<sup>h</sup>* ∧ <sup>τ</sup> *<sup>h</sup>* : prop) <sup>∧</sup> Θ,Γ *<sup>t</sup>*)

# **8 Proof Terms and Checker**

Berghofer and Nipkow [4] added proof terms to Isabelle. We present an executable checker for these proof terms that is proved sound w.r.t. the above formalization of the metalogic. Berghofer and Nipkow also developed a proof checker but it was unverified and checked the generated proof terms by feeding them back through Isabelle's unverified inference kernel.

It is crucial to realize that all we need to know about the proof term checker is the soundness theorem below. The internals are, from a soundness perspective, irrelevant, which is why we can get away with sketching them informally. This is in contrast to the logic itself, which acts like a specification, which is why we presented it in detail.

This is our data type of proof terms:

```
datatype proofterm = PAxm term (((var × sort) × typ) list) | PBound nat
 | Abst typ proofterm | AbsP term proofterm | Appt proofterm term
 | AppP proofterm proofterm | OfClass typ name | Hyp term
```
These proof terms are not designed to record proofs in our inference system, but to mirror the proof terms generated by Isabelle. Nevertheless, the constructors of our proof terms correspond roughly to the rules of the inference system. PAxm contains an axiom and a type substitution. This substitution is encoded as an association list instead of a function. AbsP and Abst correspond to introduction of =⇒ and -, AppP and Appt correspond to the respective eliminations. Hyp and PBound relate to the assumption rule, where Hyp refers to a free assumption while PBound contains a De Brujin index referring to an assumption added during the proof by an AbsP constructor. OfClass denotes a proof that a type belongs to a given type class.

Isabelle looks at terms modulo αβη-equivalence and therefore does not save β or η steps, while they are explicit steps in our inference system. Therefore we have no constructors corresponding to the (β) and (η) rules. The remaining equality axioms are naturally handled by the PAxm constructor.

In the rest of the section we discuss how to derive an executable proof checker. Executability means that the checker is defined as a set of recursive functions that Isabelle's code generator can translate into one of a number of target languages, in particular its implementation language SML [5,9,8].

Because of the approximate correspondence between proof term constructors and inference rules, implementing the proof checker largely amounts to providing executable versions of each inference rule, as in LCF: each rule becomes a function that checks the side conditions, and if they are true, computes the conclusion from the premises given as arguments. The overall checker is a function

# replay :: *theory* ⇒ *proofterm* ⇒ *term option*

In particular we need to make the inductive wellformedness checks for sorts, types and terms, signatures and theories executable. Mostly, this amounts to providing recursive versions of inductive definitions and proving them equivalent.

We now discuss some of the more difficult implementation steps. To model Isabelle's view of terms modulo αβη-equivalence, we βη normalize our terms (αequivalence is for free thanks to De Brujin notation) during the reconstruction of the proof. A lengthy proof shows that this preserves provability (we do not go into the details):

wf-theory <sup>Θ</sup> <sup>∧</sup> finite <sup>Γ</sup> <sup>∧</sup> (<sup>∀</sup> *<sup>A</sup>*∈Γ . wt-term (sig <sup>Θ</sup>) *<sup>A</sup>* ∧ <sup>τ</sup> *<sup>A</sup>* : prop) <sup>∧</sup> Θ,Γ *<sup>t</sup>* <sup>∧</sup> beta-eta-norm *<sup>t</sup>* <sup>=</sup> Some *<sup>u</sup>* −→ Θ,Γ *<sup>u</sup>*

Isabelle's code generator needs some help handling the maps used in the (ordersorted) signatures. We provide a refinement of maps to association lists. Another problematic point is the definition of the type instance relation (-), which contains an (unbounded) existential quantifier. To make this executable, we provide an implementation which tries to compute a suitable type substitution. In another step, we refine the type substitution to an association list as well.

In the end we obtain a proof checker

# check-proof <sup>Θ</sup> *P p* = (wf-theory <sup>Θ</sup> <sup>∧</sup> replay <sup>Θ</sup> *<sup>P</sup>* <sup>=</sup> Some *<sup>p</sup>*)

that checks theory Θ and checks if proof *P* proves the given proposition *p*. The latter check is important because the Isabelle theorems that we check contain both a proof and a proposition that the theorem claims to prove. Function checkproof checks this claim. As one of our main results, we can prove the correctness of our checker:

# **Theorem 2.** check-proof Θ *P p* −→ Θ,set (hyps *P*) *p*

The proof itself is conceptually simple and proceeds by induction over the structure of proof terms. For each proof constructor we need to show that the corresponding inference rule leads to the same conclusion as its functional version used by replay. Most of the proof effort goes into a large library of results about terms, types, signatures, substitutions, wellformedness etc. required for the proof, most importantly the fact that βη normalization preserve provability.

# **9 Size and Structure of the Formalization**

All material presented so far has been formalized in Isabelle/HOL. The definition of the inference system (incl. types, terms etc.) resides in a separate theory *Core* that depends only on the basic library of Isabelle/HOL. It takes about 300 LOC and is fairly high level and readable – we presented most of it. This is at least an order or magnitude smaller than Isabelle's inference kernel (which is not clearly delineated) – of course the latter is optimized for performance. Its abstract type of theorems alone takes about 2,500 LOC, not counting any infrastructure of terms, types, unification etc.

The whole formalization consists of 10,000 LOC. The main components are:


# **10 Integration with Isabelle**

As explained above, Isabelle generates SML code for the proof checker. This code has its own definitions of types, terms etc. and needs to be interfaced with the corresponding data structures in Isabelle. This step requires 150 lines of handwritten SML code (*glue code*) that translates Isabelle's data structures into the corresponding data structures in the generated proof checker such that we can feed them into check-proof. We cannot verify this code and therefore aim to keep it as small and simple as possible. This is the reason for the previously mentioned *intentional implementation bias* we introduced in our formalization. We describe now how the various data types are translated. We call a translation trivial if it merely replaces one constructor by another, possibly forgetting some information.

The translation of types and terms is trivial as their structure is almost identical in the two settings. For Isabelle code experts it should be mentioned that the two term constructors Free and Var in Isabelle (which both represent free variables but Var can be instantiated by unification) are combined in type *var* of the formalization which we left unspecified but which in fact looks like this: **datatype** *var* <sup>=</sup> *Free name* <sup>|</sup> *Var indexname*. This is purely to trivialize the glue code, in our formalization *var* is totally opaque.

Proof term translation is trivial except for two special cases. Previously proved lemmas become axioms in the translation (see also below) and so-called "oracles" (typically the result of unfinished proofs, i.e. "sorry" on the user level) are rejected (but none of the theories we checked contain oracles). Also remember that the translation of proofs is not safety critical because all that matters is that in the end we obtain a correct proof of the claimed proposition.

We also provide functions to translate relevant content from the background theory: axioms and (order-sorted) signatures. This mostly amounts to extracting association lists from efficient internal data structures. Translating the axioms also involves translating some alternative internal representation of type class constraints into their standard form presented in Sect. 7.3.

The checker is integrated into Isabelle by calling it every time a new named theorem has been proved. The set of theorems proved so far is added to the axiomatic basis for this check. Cyclic dependencies between lemmas are ruled out by this ordering because every theorem is checked before being added to the axiomatic basis. However, an explicit cyclicity check is not part of the formalization (yet), which speaks only about checking single proofs.

#### **11 Running the Proof Checker**

We run this modified Isabelle with our proof checker on multiple theories in various object logics contained in the Isabelle distribution. A rough overview of the scope of the covered material for some logics and the required running times can be found in the following table. The running times are the total times for running Isabelle, not just the proof checking, but the latter takes 90% of the time. All tests were performed on a Intel Core i7-9750H CPU running at 2.60GHz and 32GB of RAM.


We can check the material in several smaller object logics in their entirety. One of the larger such logics is first-order logic (FOL). These logics do not develop any applications but FOL comes with proof automation and theories testing that automation, in particular Pelletier's collection of problems that were considered challenges in their day [32]. Because the proofs are found automatically, the resulting proof terms will typically be quite complex and good test material for a proof checker.

The logic ZF (Zermelo-Fraenkel set theory) builds on FOL but contains real applications and is an order of magnitude larger than FOL. We are able to check all material formalized in ZF in the Isabelle distribution.

Isabelle's most frequently used and largest object logic is HOL. We managed to check about 12% of the Main library. This includes the basic logic and the libraries of sets, functions, orderings, lattices and groups. The formalizations are non-trivial and make heavy use of Isabelle's type classes.

Why can we check about five times as many lines of code in ZF compared to HOL? Profiling revealed that the proof checker spends a lot of time in functions that access the signature, especially the wellformedness checks. The primary reasons: inefficient data structures (e.g. association lists) and thus the running time depends heavily on size of signature and increases with every new constant, type and class. To make matters worse, there is no sharing of any kind in terms/types and their wellformedness checks. Because ZF is free of polymorphism and type classes, these wellformedness checks are much simpler.

## **12 Trust Assumptions**

We need to trust the following components outside of the formalization:


Because users currently cannot examine Isabelle's internal data structures that we start from, they have to trust Isabelle's front end that parses and transforms some textual input file into internal data structures. One could add a (possibly verified) presentation layer that outputs those internal representations into a readable format that can be inspected, while avoiding the traps Adams [3] is concerned with.

# **13 Future Work**

Our primary focus will be on scaling up the proof checker to not just deal with all of HOL but with real applications (including itself!). There is a host of avenues for exploration. Just to name a few promising directions: more efficient data structures than association lists (e.g. via existing frameworks [19,20]); caching of wellformedness checks for types and terms; exploiting sharing within terms and types (tricky because our intentionally simple glue code creates copies); working with the compressed proof terms [5] that Isabelle creates by default instead of uncompressing them as we do now.

We will also upgrade the formalization of our checker from individual theorems sets of theorems, explicitly checking cyclic dependencies (which are currently prevented by the glue code, see Sect. 10).

A presentation layer as discussed in Sect. 12 would not just allow the inspection of the internal representation of the theories but could also be extended to the proofs themselves, thus permitting checkers to be interfaced with Isabelle on a textual level instead of internal data structures.

It would also be nice to have a model-theoretic semantics for M. We believe that the work by Kunˇcar and Popescu [15,16,17,18] could be adapted from HOL to M. This would in particular yield semantically justified cyclicity checks for constant and type definitions which we currently treat as axioms because a purely syntactic justification is unclear.

#### **Acknowledgements**

We thank Kevin Kappelmann, Magnus Myreen, Larry Paulson, Andrei Popescu, Makarius Wenzel and the anonymous reviewers for their comments.

## **A Appendix**

```
subst-bv u t = subst-bv2 t0u
```
subst-bv2 (Bv *<sup>i</sup>*) *n u* = (if *<sup>i</sup>* <sup>&</sup>lt; *<sup>n</sup>* then Bv *<sup>i</sup>* else if *<sup>i</sup>* <sup>=</sup> *<sup>n</sup>* then *<sup>u</sup>* else Bv (*<sup>i</sup>* <sup>−</sup> *<sup>1</sup>* )) subst-bv2 (Abs *T t*) *n u* = Abs *T* (subst-bv2 *t* (*n* + *1* ) (lift *u 0* )) subst-bv2 (*<sup>f</sup> · <sup>t</sup>*) *n u* <sup>=</sup> subst-bv2 *fnu ·* subst-bv2 *tnu* subst-bv2 *t* = *t*

lift (Bv *<sup>i</sup>*) *<sup>n</sup>* = (if *<sup>n</sup>* <sup>≤</sup> *<sup>i</sup>* then Bv (*<sup>i</sup>* <sup>+</sup> *<sup>1</sup>* ) else Bv *<sup>i</sup>*) lift (Abs *T t*) *n* = Abs *T* (lift *t* (*n* + *1* )) lift (*<sup>f</sup> · <sup>t</sup>*) *<sup>n</sup>* <sup>=</sup> lift *f n ·* lift *t n* lift *t* = *t*

bind-fv *T t* = bind-fv2 *T0t*

bind-fv2 *var n* (Fv *v T*)=(if *var* = (*v*, *T*) then Bv *n* else Fv *v T*) bind-fv2 *var n* (Abs *T t*) = Abs *T* (bind-fv2 *var* (*n* + *1* ) *t*) bind-fv2 *var n* (*<sup>f</sup> · <sup>u</sup>*) = bind-fv2 *var n f ·* bind-fv2 *var n u* bind-fv2 *t* = *t*

#### **References**

1. ˚Aman Pohjola, J., Gengelbach, A.: A mechanised semantics for HOL with adhoc overloading. In: Albert, E., Kov´acs, L. (eds.) LPAR 2020: 23rd International Conference on Logic for Programming, Artificial Intelligence and Reasoning. EPiC Series in Computing, vol. 73, pp. 498–515. EasyChair (2020), https://doi.org/ 10.29007/413d


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Theory and Principles**

# The ksmt Calculus Is a *δ*-complete Decision Procedure for Non-linear Constraints

Franz Brauße<sup>2</sup> , Konstantin Korovin<sup>2</sup> , Margarita V. Korovina<sup>3</sup> , and Norbert Th. M¨uller<sup>1</sup>

 Abteilung Informatikwissenschaften, Universität Trier, Trier, Germany The University of Manchester, Manchester, UK A.P. Ershov Institute of Informatics Systems, Novosibirsk, Russia brausse@informatik.uni-trier.de, konstantin.korovin@manchester.ac.uk

Abstract. ksmt is a CDCL-style calculus for solving non-linear constraints over the real numbers involving polynomials and transcendental functions. In this paper we investigate properties of the ksmt calculus and show that it is a δ-complete decision procedure for bounded problems. We also propose an extension with local linearisations, which allow for more efficient treatment of non-linear constraints.

## 1 Introduction

Solving non-linear constraints is important in many applications, including verification of cyber-physical systems, software verification, proof assistants for mathematics [25,21,2,1,15,6]. Hence there has been a number of approaches for solving non-linear constraints, involving symbolic methods [16,23,29,18] as well as numerically inspired ones, in particular for dealing with transcendental functions [13,30], and combinations of symbolic and numeric methods [7,11,12].

In [7] we introduced the ksmt calculus for solving non-linear constraints over a large class of functions including polynomial, exponential and trigonometric functions. The ksmt calculus<sup>4</sup> combines CDCL-style reasoning [28,22,3] over the reals based on conflict resolution [19] with incremental linearisations of nonlinear functions using methods from computable analysis [31,24]. Our approach is based on computable analysis and exact real arithmetic which avoids limitations of double precision computations caused by rounding errors and instabilities in numerical methods. In particular, satisfiable and unsatisfiable results returned by ksmt are exact as required in many applications. This approach also supports implicit representations of functions as solutions of ODEs and PDEs [26].

It is well known that in the presence of transcendental functions the constraint satisfiability problem is undecidable [27]. However if we only require solutions up to some specified precision δ, then the problem can be solved algorithmically on bounded instances and that is the motivation behind δ-completeness,

<sup>-</sup> This research was partially supported by an Intel research grant, the DFG grant WERA MU 1801/5-1 and the RFBR-JSPS 20-51-5000 grant.

<sup>4</sup> Implementation is available at http://informatik.uni-trier.de/~brausse/ksmt/

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 113 130, 2021. https://doi.org/10.1007/978-3-030-79876-5\_7 –

which was introduced in [13]. In essence a δ-complete procedure decides if a formula is unsatisfiable or a δ weakening of the formula is satisfiable.

In this paper we investigate theoretical properties of the ksmt calculus, and its extension δ-ksmt for the δ-SMT setting. Our main results are as follows:


In Section 3, we give an overview about the ksmt calculus and introduce the notion of -full linearisation used throughout the rest of the paper. We also present a completeness theorem. Section 4 introduces the notion of δcompleteness and related concepts. In Section 5 we introduce the δ-ksmt adaptation, prove it is correct and δ-complete, and give concrete effective linearisations based on a uniform modulus of continuity. Finally in Section 6, we introduce local linearisations and show that termination is independent of computing uniform moduli of continuity, before we conclude in Section 7. for *<sup>c</sup>* <sup>∈</sup> <sup>R</sup><sup>n</sup> and > <sup>0</sup> and <sup>A</sup><sup>s</sup> to denote the closure of the set <sup>A</sup> <sup>⊆</sup> <sup>R</sup><sup>n</sup> in the

#### 2 Preliminaries

The following conventions are used throughout this paper. By -· we denote the maximum-norm -(x1, x2,...,xn)- = max{|xi<sup>|</sup> : 1 <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>n</sup>}. When it helps clarity, we write finite and infinite sequences *x* = (x1,...,xn) and *y* = (yi)<sup>i</sup> in bold typeface. We are going to use open balls <sup>B</sup>(*c*, ) = {*<sup>x</sup>* : *x* − *c*- < } ⊆ <sup>R</sup><sup>n</sup> standard topology induced by the norm. By <sup>Q</sup>><sup>0</sup> we denote the set {<sup>q</sup> <sup>∈</sup> <sup>Q</sup> : q > <sup>0</sup>}. For sets X, Y , a (possibly partial) function from <sup>X</sup> to <sup>Y</sup> is written as <sup>X</sup> <sup>→</sup> <sup>Y</sup> . We use the notion of compactness: a set <sup>A</sup> is compact iff every open cover of A has a finite subcover. In Euclidean spaces this is equivalent to A being bounded and closed [32].

#### Basic Notions of Computable Analysis

Let us recall the notion of computability of functions over real numbers used throughout this paper. A rational number q is an n*-approximation* of a real number <sup>x</sup> if <sup>q</sup> <sup>−</sup> <sup>x</sup>-<sup>≤</sup> <sup>2</sup>−<sup>n</sup>. Informally, a function <sup>f</sup> is *computed* by a functionoracle Turing machine M? <sup>f</sup> , where ? is a placeholder for the oracle representing the argument of the function, in the following way. The real argument x is represented by an oracle function <sup>ϕ</sup> : <sup>N</sup> <sup>→</sup> <sup>Q</sup>, for each <sup>n</sup> returning an <sup>n</sup>-approximation ϕ<sup>n</sup> of x. For simplicity, we refer to ϕ by the sequence (ϕn)n. When run with argument <sup>p</sup> <sup>∈</sup> <sup>N</sup>, <sup>M</sup><sup>ϕ</sup> <sup>f</sup> (p) computes a rational <sup>p</sup>-approximation of <sup>f</sup>(x) by querying its oracle ϕ for approximations of x. Let us note that the definition of the oracle machine does not depend on the concrete oracle, i.e., the oracle can be seen as a parameter. In case only the machine without a concrete oracle is of interest, we write M? <sup>f</sup> . We refer to [17] for a precise definition of the model of computation by function-oracle Turing machines which is standard in computable analysis.

Definition 1 ([17]). *Consider <sup>x</sup>* <sup>∈</sup> <sup>R</sup>n*. A* name *for <sup>x</sup> is a rational sequence <sup>ϕ</sup>* = (*ϕ*k)<sup>k</sup> *such that* <sup>∀</sup><sup>k</sup> : *ϕ*k−*x*-<sup>≤</sup> <sup>2</sup>−k*. A function* <sup>f</sup> : <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>R</sup> *is* computable *iff there is a function-oracle Turing machine* M? <sup>f</sup> *such that for all <sup>x</sup>* <sup>∈</sup> dom <sup>f</sup> *and names <sup>ϕ</sup> for <sup>x</sup>,* <sup>|</sup>M*<sup>ϕ</sup>* <sup>f</sup> (p) <sup>−</sup> <sup>f</sup>(*x*)| ≤ <sup>2</sup>−<sup>p</sup> *holds for all* <sup>p</sup> <sup>∈</sup> <sup>N</sup>*.*

This definition is closely related to interval arithmetic with unrestricted precision, but enhanced with the guarantee of convergence and it is equivalent to the notion of computability used in [31]. The class of computable functions contains polynomials and transcendental functions like sin, cos, exp, among others. It is well known [17,31] that this class is closed under composition and that computable functions are continuous. By continuity, a computable function <sup>f</sup> : <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>R</sup> total on a compact <sup>D</sup> <sup>⊂</sup> <sup>R</sup><sup>n</sup> has a computable *uniform modulus of continuity* <sup>μ</sup><sup>f</sup> : <sup>N</sup> <sup>→</sup> <sup>N</sup> on <sup>D</sup> [31, Theorem 6.2.7], that is,

$$\forall k \in \mathbb{N} \forall y, z \in D: \|y - z\| \le 2^{-\mu(k)} \implies |f(y) - f(z)| \le 2^{-k}.\tag{2.1}$$

A uniform modulus of continuity of f expresses how changes in the value of f depend on changes of the arguments in a uniform way.

# 3 The ksmt Calculus

We first describe the ksmt calculus for solving non-linear constraints [7] informally, and subsequently recall the main definitions which we use in this paper. The ksmt calculus consists of transition rules, which, for any formula in linear separated form, allow deriving lemmas consistent with the formula and, in case of termination, produce a satisfying assignment for the formula or show that it is unsatisfiable. A quantifier-free formula is in separated linear form L∪N if L is a set of clauses over linear constraints and N is a set of non-linear atomic constraints; this notion is rigorously defined below.

In the ksmt calculus there are four transition rules applied to its states: Assignment refinement (A), Conflict resolution (R), Backjumping (B) and Linearisation (L). The final ksmt states are sat and unsat. A non-final ksmt state is a triple (α,L, <sup>N</sup> ) where <sup>α</sup> is a (partial) assignment of variables to rationals. A ksmt derivation starts with an initial state where α is empty and tries to extend this assignment to a solution of L∪N by repeatedly applying the Assignment refinement rule. When such assignment extension is not possible we either obtain a linear conflict which is resolved using the conflict resolution rule, or a non-linear conflict which is resolved using the linearisation rule.

The main idea behind the linearisation rule is to approximate the non-linear constraints around the conflict using linear constraints in such a way that the

Fig. 1. Core of ksmt calculus. Derivations terminate in red nodes.

conflict will be shifted into the linear part where it will be resolved using conflict resolution. Application of either of these two rules results in a state containing a clause evaluating to false under the current assignment. This is followed by either application of the backjumping rule, which undoes assignments or by termination in case the formula is unsat. In this procedure, only the assignment and linear part of the state change and the non-linear part stays fixed.

*Notations.* Let Flin consist of rational constants, addition and multiplication by rational constants; Fnl denotes an arbitrary collection of non-linear computable functions including transcendental functions and polynomials over the reals. We consider the structure (R,Flin ∪ Fnl,P) where <sup>P</sup> <sup>=</sup> {<, <sup>≤</sup>, >, <sup>≥</sup>, <sup>=</sup>, =} and a set of variables <sup>V</sup> <sup>=</sup> {x1, x2,...,xn,...}. We will use, possibly with indices, <sup>x</sup> to denote variables and q, c, e for rational constants. Define terms, predicates and formulas over V in the standard way. An *atomic linear constraint* is a formula of the form: <sup>q</sup> <sup>+</sup> <sup>c</sup>1x<sup>1</sup> <sup>+</sup> ... <sup>+</sup> <sup>c</sup>nx<sup>n</sup> <sup>0</sup> where q, c1,...,c<sup>n</sup> <sup>∈</sup> <sup>Q</sup> and ∈P. Negations of atomic formulas can be eliminated by rewriting the predicate symbol in the standard way, hence we assume that all literals are positive. A *linear constraint* is a disjunction of atomic linear constraints, also called *(linear) clause*. An *atomic non-linear* constraint is a formula of the form <sup>f</sup>(*x*) <sup>0</sup>, where ∈P and <sup>f</sup> is a composition of computable non-linear functions from Fnl over variables *x*. Throughout this paper for every computable real function f we use M? <sup>f</sup> to denote a function-oracle Turing machine computing f. We assume quantifier-free formulas in *separated linear form* [7, Definition 1], that is, L∪N where L is a set of linear constraints and N is a set of non-linear atomic constraints. Arbitrary quantifier-free formulas can be transformed equi-satisfiably into separated linear form in polynomial time [7, Lemma 1]. Since in separated linear form all nonlinear constraints are atomic we will call them just *non-linear constraints*.

Let <sup>α</sup> : <sup>V</sup> <sup>→</sup> <sup>Q</sup> be a partial variable assignment. The interpretation *x*α of a vector of variables *x* under α is defined in a standard way as componentwise application of <sup>α</sup>. Define the notation <sup>t</sup><sup>α</sup> as evaluation of term <sup>t</sup> under assignment <sup>α</sup>, that can be partial, in which case <sup>t</sup><sup>α</sup> is treated symbolically. We extend -·<sup>α</sup> to predicates, clauses and CNF in the usual way and true, false denote the constants of the Boolean domain. The evaluation <sup>t</sup> <sup>0</sup><sup>α</sup> for a predicate and a term t results in true or false only if all variables in t are assigned by α.

In order to formally restate the calculus, the notions of linear resolvent and linearisation are essential. A resolvent <sup>R</sup>α,L,z on a variable <sup>z</sup> is a set of linear constraints that do not contain <sup>z</sup>, are implied by the formula <sup>L</sup> and which evaluate to false under the current partial assignment α; for more details see [19,7].

Definition 2. *Let* P *be a non-linear constraint and let* α *be an assignment with* -<sup>P</sup><sup>α</sup> <sup>=</sup> false*. A* linearisation of <sup>P</sup> at <sup>α</sup> *is a linear clause* <sup>C</sup> *with the properties:*

*1.* <sup>∀</sup><sup>β</sup> : -<sup>P</sup><sup>β</sup> <sup>=</sup> true <sup>=</sup><sup>⇒</sup> -<sup>C</sup><sup>β</sup> <sup>=</sup> true*, and 2.* -<sup>C</sup><sup>α</sup> <sup>=</sup> false*.*

Wlog. we can assume that the variables of C are a subset of the variables of P. Let us note that any linear clause C represents the complement of a rational polytope R and we will use both interchangeably. Thus for a rational polytope <sup>R</sup>, *<sup>x</sup>* ∈ <sup>R</sup> also stands for a linear clause. In particular, any linearisation excludes a rational polytope containing the conflicting assignment from the search space.

*Transition rules.* For a formula L<sup>0</sup> ∪N in separated linear form, the initial ksmt state is (nil,L0, <sup>N</sup> ). The calculus consists of the following transition rules from a state <sup>S</sup> = (α,L, <sup>N</sup> ) to <sup>S</sup> :


A path (or a run) is a derivation in a ksmt. A procedure is an effective (possibly non-deterministic) way to construct a path.

*Termination.* If no transition rule is applicable, the derivation terminates. For clarity, we added the explicit rules (Fsat) and (Funsat) which lead to the final states. This calculus is sound [7, Lemma 2]: if the final transition is (Fsat), then α is a solution to the original formula, or (Funsat), then a trivial contradiction 0 > 1 was derived and the original formula is unsatisfiable. The calculus also makes progress by reducing the search space [7, Lemma 3].


Fig. 2. unsat example run of ksmt using interval linearisation [7].

An example run of the ksmt calculus is presented in Figure 2. We start in a state with a non-linear part <sup>N</sup> <sup>=</sup> {<sup>y</sup> <sup>≤</sup> <sup>1</sup>/x}, which defines the pink area and the linear part <sup>L</sup> <sup>=</sup> {(x/4+1 <sup>≤</sup> <sup>y</sup>),(<sup>y</sup> <sup>≤</sup> <sup>4</sup> · (<sup>x</sup> <sup>−</sup> 1))}, shaded in green. Then we successively apply ksmt rules excluding regions around candidate solutions by linearisations, until we derive linearisations which separates the pink area from the green area thus deriving a contradiction.

*Remark 1.* In general a derivation may not terminate. The only cause of nontermination is the linearisation rule which adds new linear constraints and can be applied infinitely many times. To see this, observe that ksmt with only the rules (A),(R),(B) corresponds to the conflict resolution calculus which is known to be terminating [19,20]. Thus, in infinite ksmt runs the linearisation rule (L) is applied infinitely often. This argument is used in the proof of Theorem 1 below. Let us note that during a run the ksmt calculus neither conflicts nor lemmas can be generated more than once. In fact, any generated linearisation is not implied by the linear part, prior to adding this linearisation.

#### 3.1 Sufficient Termination Conditions

In this section we will assume that (α,L, <sup>N</sup> ) is a ksmt state obtained by applying ksmt inference rules to an initial state. As in [13] we only consider bounded instances. In many applications this is a natural assumption as variables usually range within some (possibly large) bounds. We can assume that these bounds are made explicit as linear constraints in the system.

Definition 3. *Let* <sup>F</sup> *be the formula* <sup>L</sup><sup>0</sup> ∧ N *in separated linear form over variables* x1,...,x<sup>n</sup> *and let* B<sup>i</sup> *be the set defined by the conjunction of all clauses* *in* <sup>L</sup><sup>0</sup> *univariate in* <sup>x</sup>i*, for* <sup>i</sup> = 1,...,n*; in particular, if there are no univariate linear constraints over* x<sup>i</sup> *then* B<sup>i</sup> = R*. We call* F *a* bounded instance *if:*

Ě


By this definition, already the linear part of bounded instances explicitly defines a bounded set by univariate constraints. Consequently, the set of solutions of F is bounded as well.

In Theorem 1 we show that when we consider bounded instances and restrict linearisations to so-called -full linearisations, then the procedure terminates. We use this to show that the ksmt-based decision procedure we introduce in Section 5 is δ-complete.

Definition 4. *Let* > 0*,* P *be a non-linear constraint over variables x and let* α *be an assignment of x. A linearisation* C *of* P *at* α *is called* -full *iff for all assignments* <sup>β</sup> *of <sup>x</sup> with <sup>x</sup>*<sup>β</sup> <sup>∈</sup> <sup>B</sup>(*<sup>x</sup>*<sup>α</sup>, )*,* -<sup>C</sup><sup>β</sup> <sup>=</sup> false*.*

*A* ksmt *run is called -full for some* > 0*, if all but finitely many linearisations in this run are -full.*

The next theorem provides a basis for termination of ksmt-based decision procedures for satisfiability.

# Theorem 1. *Let* > 0*. On bounded instances, -full* ksmt *runs are terminating.*

*Proof.* Let <sup>F</sup> : <sup>L</sup><sup>0</sup> ∧ N be a bounded instance and > <sup>0</sup>. Towards a contradiction assume there is an infinite -full derivation (α0,L0, <sup>N</sup> ),...,(αn,Ln, <sup>N</sup> ),... in the ksmt calculus. Then, by definition of the transition rules, L<sup>k</sup> ⊆ L<sup>l</sup> for all k,l with <sup>0</sup> <sup>≤</sup> <sup>k</sup> <sup>≤</sup> <sup>l</sup>. According to Remark <sup>1</sup> in any infinite derivation the linearisation rule must be applied infinitely many times. During any run of ksmt the set of non-linear constraints N is fixed and therefore there is a non-linear constraint <sup>P</sup> in <sup>N</sup> over variables *<sup>x</sup>* to which linearisation is applied infinitely often. Let (α<sup>i</sup><sup>1</sup> ,L<sup>i</sup><sup>1</sup> , <sup>N</sup> ),...,(α<sup>i</sup><sup>n</sup> ,L<sup>i</sup><sup>n</sup> , <sup>N</sup> ),... be a corresponding subsequence in the derivation such that <sup>C</sup><sup>i</sup><sup>1</sup> ∈ L<sup>i</sup>1+1,...,C<sup>i</sup><sup>n</sup> ∈ L<sup>i</sup>n+1,... are -full linearisations of <sup>P</sup>. Consider two different linearisation steps k, ∈ {i<sup>j</sup> : <sup>j</sup> <sup>∈</sup> <sup>N</sup>} in the derivation where k<. By the precondition of rule (L) applied in step we have -<sup>L</sup><sup>α</sup>- <sup>=</sup> false. In particular the linearisation <sup>C</sup><sup>k</sup> ∈ Lk+1 ⊆ L of <sup>P</sup> constructed in step k does not evaluate to false under α. Since the set of variables in <sup>C</sup><sup>k</sup> is a subset of those in <sup>P</sup>, -<sup>C</sup><sup>k</sup><sup>α</sup>- <sup>=</sup> false implies -<sup>C</sup><sup>k</sup><sup>α</sup>- = true. By assumption, the linearisation C<sup>k</sup> is -full, thus from Definition 4 it follows that *<sup>x</sup>*<sup>α</sup>- <sup>∈</sup>/ <sup>B</sup>(*<sup>x</sup>*<sup>α</sup><sup>k</sup> , ). Therefore the distance between *<sup>x</sup>*<sup>α</sup><sup>k</sup> and *<sup>x</sup>*<sup>α</sup> is at least . However, every conflict satisfies the variable bounds defining D<sup>F</sup> , so there could be only finitely many conflicts with pairwise distance at least . This contradicts the above.

Concrete algorithms to compute -full linearisations are presented in Sections 5 and 6.

Fig. 3. The overlapping cases in the <sup>δ</sup>-SMT problem <sup>f</sup>(x) <sup>≤</sup> <sup>0</sup>.

## 4 *δ*-decidability

In the last section, we proved termination of the ksmt calculus on bounded instances when linearisations are -full. Let us now investigate how -full linearisations of constraints involving non-linear computable functions can be constructed. To that end, we assume that all non-linear functions are defined on the closure of the bounded space D<sup>F</sup> defined by the bounded instance F.

So far we described an approach which gives exact results but at the same time is necessarily incomplete due to undecidability of non-linear constraints in general. On the other hand, non-linear constraints usually can be approximated using numerical methods allowing to obtain approximate solutions to the problem. This gives rise to the bounded δ-SMT problem [13] which allows an overlap between the properties δ-sat and unsat of formulas as illustrated by Figure 3. It is precisely this overlap that enables δ-decidability of bounded instances.

Let us recall the notion of δ-decidability, adapted from [13].

Definition 5. *Let* <sup>F</sup> *be a formula in separated linear form and let* <sup>δ</sup> <sup>∈</sup> <sup>Q</sup>>0*. We inductively define the* δ*-weakening* F<sup>δ</sup> *of* F*.*

– *If* F *is linear, let* F<sup>δ</sup> := F*.* – *If* <sup>F</sup> *is a non-linear constraint* <sup>f</sup>(*x*) <sup>0</sup>*, let*

F<sup>δ</sup> := ⎧ ⎪⎪⎪⎨ ⎪⎪⎪⎩ <sup>f</sup>(*x*) <sup>−</sup> <sup>δ</sup> <sup>0</sup>, *if* ∈{<, ≤} <sup>f</sup>(*x*) + <sup>δ</sup> <sup>0</sup>, *if* ∈{>, ≥} <sup>|</sup>f(*x*)| − <sup>δ</sup> <sup>≤</sup> <sup>0</sup>, *if* ∈{=} (f(*x*) <sup>&</sup>lt; <sup>0</sup> <sup>∨</sup> <sup>f</sup>(*x*) <sup>&</sup>gt; 0)δ, *if* ∈{=}*.*

– *Otherwise,* <sup>F</sup> *is* <sup>A</sup> ◦ <sup>B</sup> *with* ◦ ∈ {∧, ∨}*. Let* <sup>F</sup><sup>δ</sup> := (A<sup>δ</sup> ◦ <sup>B</sup>δ)*.*

δ-deciding F *designates computing*

$$\begin{cases} \mathfrak{umsat}, & \text{if } [F]^\alpha = \mathfrak{false} \text{ for all } \alpha\\ \delta \text{ -satt}, & \text{if } [F\_\delta]^\alpha = \text{true} \text{ for some } \alpha. \end{cases}$$

*In case both answers are valid, the algorithm may output any.*

*An assignment* <sup>α</sup> *with* -<sup>F</sup><sup>δ</sup><sup>α</sup> <sup>=</sup> true *we call a* <sup>δ</sup>-satisfying assignment *for* <sup>F</sup>*.*

For non-linear constraints P this definition of the δ-weakening P<sup>δ</sup> corresponds exactly to the notion of δ-weakening P <sup>−</sup><sup>δ</sup> used in the introduction of δ-decidability [14, Definition 4.1].

*Remark 2.* The <sup>δ</sup>-weakening of a non-linear constraint <sup>f</sup>(*x*) = 0 is a tautology.

Ě

We now consider the problem of δ-deciding quantifier-free formulas in separated linear form. The notion of δ-decidability is slightly stronger than in [13] in the sense that we do not weaken linear constraints. Consider a formula F in separated linear form. As before, we assume variables *x* to be bounded by linear constraints *<sup>x</sup>* <sup>∈</sup> <sup>D</sup><sup>F</sup> . We additionally assume that for all non-linear constraints <sup>P</sup> : <sup>f</sup>(*x*) <sup>0</sup> in <sup>N</sup> , <sup>f</sup> is defined on <sup>D</sup> <sup>P</sup> and, in order to simplify the presentation, throughout the rest of paper we will assume only the predicates ∈{>, ≥} are part of formulas, since the remaining ones <, <sup>≤</sup>, <sup>=</sup> can easily be expressed by the former using simple arithmetic transformations, and by Remark 2 predicates = are irrelevant for δ-deciding formulas.

An algorithm is δ*-complete*, if it δ-decides bounded instances [13].

# 5 *δ*-ksmt

Since δ-decidability as introduced above adapts the condition when a formula is considered to be satisfied to δ-sat, this condition has to be reflected in the calculus, which we show solves the bounded δ-SMT problem in this section. Adding the following rule (Fsat <sup>δ</sup> ) together with the new final state <sup>δ</sup>-sat to ksmt relaxes the termination conditions and turns it into the extended calculus we call δ-ksmt.

(Fsat <sup>δ</sup> ) *Final* <sup>δ</sup>*-*sat*.* If (α,L, <sup>N</sup> ) is a <sup>δ</sup>-ksmt state where <sup>α</sup> is a total assignment and -L∧N<sup>δ</sup><sup>α</sup> <sup>=</sup> true, transition to the <sup>δ</sup>-sat state.

The applicability conditions on the rules (L) and (Fsat <sup>δ</sup> ) individually are not decidable [27,5], however, when we compute them simultaneously, we can effectively apply one of these rules, as we will show in Lemma 3. In combination with -fullness of the computed linearisations (Lemma 4), this leads to Theorem 3, showing that δ-ksmt is a δ-complete decision procedure.

Let us note that if we assume δ = 0 then δ-ksmt would just reduce to ksmt as (Fsat) and (Fsat <sup>δ</sup> ) become indistinguishable, but in the following we always assume δ > 0.

In the following sub-section, we prove that terminating derivations of the δksmt calculus lead to correct results. Then, in Section 5.2, we present a concrete algorithm for applying rules (L) and (Fsat <sup>δ</sup> ) and show its linearisations to be -full, which is sufficient to ensure termination, as shown in Theorem 1. These properties lead to a δ-complete decision procedure. In Section 6 we develop a more practical algorithm for -full linearisations that does not require computing a uniform modulus of continuity.

#### 5.1 Soundness

In this section we show soundness of the δ-ksmt calculus, that is, validity of its derivations. In particular, this implies that derivability of the final states unsat, δ-sat and sat directly corresponds to unsatisfiability, δ-satisfiability and satisfiability of the original formula, respectively.

Lemma 1. *For all* δ*-*ksmt *derivations of* S = (α ,L , <sup>N</sup> ) *from a state* <sup>S</sup> <sup>=</sup> (α,L, <sup>N</sup> ) *and for all total assignments* <sup>β</sup>*,* -L∧N <sup>β</sup> <sup>=</sup> -<sup>L</sup> ∧ N <sup>β</sup>*.*

*Proof.* Let <sup>β</sup> be a total assignment of the variables in L∧N . Since the set of variables remains unchanged by δ-ksmt derivations, β is a total assignment for <sup>L</sup> ∧ N as well. Let <sup>S</sup> = (α ,L , <sup>N</sup> ) be derived from <sup>S</sup> = (α,L, <sup>N</sup> ) by a single application of one of δ-ksmt rules. By the structure of S , its derivation was not caused by neither (Funsat),(Fsat) or (Fsat <sup>δ</sup> ). For rules (A) and (B) there is nothing to show since L = L . If (R) caused <sup>S</sup> → <sup>S</sup> , the claim holds by soundness of arithmetical resolution. Otherwise (L) caused <sup>S</sup> → <sup>S</sup> in which case the direction ⇒ follows from the definition of a linearisation (condition 1 in Definition 2) while the other direction trivially holds since L⊆L .

The condition on derivations of arbitrary lengths then follows by induction.

Lemma 2. *Let* <sup>δ</sup> <sup>∈</sup> <sup>Q</sup>>0*. Consider a formula* <sup>G</sup> <sup>=</sup> <sup>L</sup><sup>0</sup> ∧ N *in separated linear form and let* <sup>S</sup> = (α,L, <sup>N</sup> ) *be a* <sup>δ</sup>*-*ksmt *state derivable from the initial state* <sup>S</sup><sup>0</sup> = (nil,L0, <sup>N</sup> )*. The following hold.*


*Proof.* Let formula G and states S0, S be as in the premise. As S is not final in δ-ksmt, only ksmt rules have been applied in deriving it. The statements for rules (Funsat) and (Fsat) thus hold by soundness of ksmt [7, Lemma 2].

Assume (Fsat <sup>δ</sup> ) is applicable to <sup>S</sup>, that is, -L∧N<sup>δ</sup><sup>α</sup> is true. Then, since <sup>L</sup><sup>0</sup> ⊆ L, we conclude that <sup>α</sup> satisfies <sup>L</sup><sup>0</sup> ∧ N<sup>δ</sup> which, according to Definition 5, equals Gδ. Therefore α is a δ-satisfying assignment for G.

Since the only way to derive one of the final states unsat, δ-sat and sat from the initial state in δ-ksmt is by application of the rule (Funsat),(Fsat <sup>δ</sup> ) and (Fsat), respectively, as corollary of Lemmas 1 and 2 we obtain soundness.

Theorem 2 (Soundness). *Let* <sup>δ</sup> <sup>∈</sup> <sup>Q</sup>><sup>0</sup>*. The* <sup>δ</sup>*-*ksmt *calculus is sound.*

#### 5.2 *δ*-completeness

We proceed by introducing Algorithm 1 computing linearisations and deciding which of the rules (Fsat <sup>δ</sup> ) and (L) to apply. These linearisations are then shown to be -full for some > 0 depending on the bounded instance. By Theorem 1, this property implies termination, showing that δ-ksmt is a δ-complete decision procedure.

Given a non-final δ-ksmt state, the function nlinStep<sup>δ</sup> in Algorithm 1 computes a δ-ksmt state derivable from it by application of (Fsat <sup>δ</sup> ) or (L). This is done by evaluating the non-linear functions and adding a linearisation based on their uniform moduli of continuity as needed. To simplify the algorithm, it assumes total assignments as input. It is possible to relax this requirement, e.g., by invoking rules (A) or (R) instead of returning δ-sat for partial assignments.

Algorithm 1 (nlinStepδ) Algorithm computing a δ-ksmt derivation according to either rule (L) or (Fsat <sup>δ</sup> ) from a state (α,L, <sup>N</sup> ) where <sup>α</sup> is total. The functions f are assumed to be computed by machines M? <sup>f</sup> and <sup>μ</sup><sup>f</sup> to be a computable uniform modulus of continuity of f.


Lemma 3. *Let* <sup>δ</sup> <sup>∈</sup> <sup>Q</sup>><sup>0</sup> *and let* <sup>S</sup> = (α,L, <sup>N</sup> ) *be a* <sup>δ</sup>*-*ksmt *state where* <sup>α</sup> *is total and* -<sup>L</sup><sup>α</sup> <sup>=</sup> true*. Then* nlinStepδ*(*α,L, <sup>N</sup> *) computes a state derivable by application of either* (L) *or* (Fsat <sup>δ</sup> ) *to* <sup>S</sup>*.* Ě

*Proof.* In the proof we will use notions from computable analysis, as defined in Section 2. Let (α,L, <sup>N</sup> ) be a state as in the premise and let <sup>P</sup> : <sup>f</sup>(*x*) <sup>0</sup> be a non-linear constraint in <sup>N</sup> . Let <sup>M</sup>? <sup>f</sup> compute <sup>f</sup> as in Algorithm 1. The algorithm computes a rational approximation y˜ = M(*x*α)<sup>i</sup> <sup>f</sup> (p) of <sup>f</sup>(*<sup>x</sup>*<sup>α</sup>) where <sup>p</sup> ≥ −log2(min{1, δ/4}) ∈ <sup>N</sup>. -<sup>L</sup><sup>α</sup> <sup>=</sup> true implies *<sup>x</sup>*<sup>α</sup> <sup>∈</sup> <sup>D</sup><sup>P</sup> <sup>⊆</sup> dom <sup>f</sup>, thus the computation of y˜ terminates. Since M? <sup>f</sup> computes <sup>f</sup>, <sup>y</sup>˜ is accurate up to <sup>2</sup>−<sup>p</sup> <sup>≤</sup> δ/4, that is, <sup>y</sup>˜ <sup>∈</sup> [f(*<sup>x</sup>*<sup>α</sup>) <sup>±</sup> δ/4]. By assumption ∈{>, ≥}, thus


For Item 1 no linearisation is necessary and indeed the algorithm does not linearise <sup>P</sup>. Otherwise (Item 2), it adds the linearisation (*<sup>x</sup>* <sup>∈</sup>/ <sup>B</sup>(*<sup>x</sup>*<sup>α</sup>, )) to the linear clauses. Since *<sup>x</sup>*<sup>α</sup> <sup>∈</sup> <sup>D</sup><sup>P</sup> by Eq. (2.1) we obtain that <sup>0</sup> <sup>∈</sup>/ <sup>B</sup>(f(*z*), δ/4) holds, implying <sup>¬</sup>(f(*z*) 0), for all *<sup>z</sup>* <sup>∈</sup> <sup>B</sup>(*<sup>x</sup>*<sup>α</sup>, )<sup>∩</sup> <sup>D</sup> <sup>P</sup> . Hence, (*<sup>x</sup>* <sup>∈</sup>/ <sup>B</sup>(*<sup>x</sup>*<sup>α</sup>, )) is a linearisation of P at α.

In case nlinStepδ(α,L, <sup>N</sup> ) returns <sup>δ</sup>-sat, the premise of Item <sup>1</sup> holds for every non-linear constraint in <sup>N</sup> , that is, -<sup>N</sup><sup>δ</sup><sup>α</sup> <sup>=</sup> true. By assumption -<sup>L</sup><sup>α</sup> <sup>=</sup> true, hence the application of the (Fsat <sup>δ</sup> ) rule deriving <sup>δ</sup>-sat is possible in <sup>δ</sup>-ksmt.

Lemma 4. *For any bounded instance* <sup>L</sup><sup>0</sup> ∧ N *there is a computable* <sup>∈</sup> <sup>Q</sup>><sup>0</sup> *such that any* <sup>δ</sup>*-*ksmt *run starting in* (nil,L0, <sup>N</sup> )*, where applications of* (L) *and* (Fsat <sup>δ</sup> ) *are performed by* nlinStepδ*, is -full.*

*Proof.* Let <sup>P</sup> : <sup>f</sup>(*x*) <sup>0</sup> be a non-linear constraint in <sup>N</sup> . Since <sup>L</sup><sup>0</sup> ∧ N is a bounded instance, <sup>D</sup><sup>P</sup> <sup>⊆</sup> <sup>R</sup><sup>n</sup> is also bounded. Let <sup>P</sup> := 2−μ<sup>f</sup> (p) where <sup>p</sup> <sup>≥</sup> −log2(min{1, δ/4}) ∈ <sup>N</sup> as in Algorithm 1. As <sup>μ</sup><sup>f</sup> is a uniform modulus of continuity, the inequalities in the following construction hold on the whole domain D <sup>P</sup> of f and do not depend on the concrete assignment α where the linearisation is performed. Since log<sup>2</sup> and μ<sup>f</sup> are computable, so are p and <sup>P</sup> . There are finitely many non-linear constraints <sup>P</sup> in <sup>N</sup> , therefore the linearisations the algorithm nlinStep<sup>δ</sup> computes are -full with = min{<sup>P</sup> : <sup>P</sup> in N } <sup>&</sup>gt; <sup>0</sup>.

We call δ-ksmt derivations when linearisation are computed using Algorithm 1 δ-ksmt with full-box linearisations, or δ*-*ksmt*-fb* for short. As the runs computed by it are -full for > 0, by Theorem 1 they terminate.

Theorem 3. δ*-*ksmt*-fb is a* δ*-complete decision procedure.*

*Proof.* δ-ksmt-fb is sound (Theorem 2) and terminates on bounded instances (Theorem 1 and Lemma 4).

## 6 Local -full Linearisations

In practice, when the algorithm computing -full linearisations described in the previous section is going to be implemented, the question arises of how to get a good uniform modulus of continuity μ<sup>f</sup> for a computable function f. Depending on how f is given, there may be several ways of computing it. Implementations of exact real arithmetic, e.g., iRRAM [24] and Ariadne [2], are usually based on the formalism of function-oracle Turing machines (see Definition 1) which allow to compute with representations of computable functions [10] including implicit representations of functions as solutions of ODEs/PDEs [26,9]. If f is only available as a function-oracle Turing machine M? <sup>f</sup> computing it, a modulus <sup>μ</sup><sup>f</sup> valid on a compact domain can be computed, however, in general this is not possible without exploring the behaviour of the function on the whole domain, which in many cases is computationally expensive. Moreover, since μ<sup>f</sup> is uniform, μ<sup>f</sup> (n) is constant throughout D<sup>F</sup> , independent of the actual assignment α determining where f is evaluated. Yet, computable functions admit *local* moduli of continuity that additionally depend on the concrete point in their domain. In most cases these would provide linearisations with larger than that determined by μ<sup>f</sup> leading to larger regions being excluded, ultimately resulting in fewer linearisation steps and general speed-up. Indeed, machines producing finite approximations of f(x) from finite approximations of x internally have to compute some form of local modulus to guarantee correctness. In this section, we explore this approach of obtaining linearisations covering a larger part of the function's domain.

In order to guarantee a positive bound on the local modulus of continuity extracted directly from the run of the machine M? <sup>f</sup> computing <sup>f</sup>, it is necessary to employ a restriction on the names of real numbers M? <sup>f</sup> computes on. The set of names should in a very precise sense be "small", i.e., it has to be compact. The very general notion of names used in Definition 1 is too broad to satisfy this criterion since the space of rational approximations is not even locally compact. Here, we present an approach using practical names of real numbers as

Ě

sequences of dyadic rationals of lengths restricted by accuracy. For that purpose, we introduce another representation [31] of R, that is, the surjective mapping <sup>ξ</sup> : <sup>D</sup><sup>ω</sup> <sup>→</sup> <sup>R</sup>. Here, <sup>D</sup><sup>ω</sup> denotes the set of infinite sequences <sup>ϕ</sup> of dyadic rationals with bounded length. If ϕ has a limit (in R), we write lim ϕ.

Definition 6. – *For* <sup>k</sup> <sup>∈</sup> <sup>ω</sup> *let* <sup>D</sup><sup>k</sup> := <sup>Z</sup> · <sup>2</sup>−(k+1) <sup>=</sup> {m/2k+1 : <sup>m</sup> <sup>∈</sup> <sup>Z</sup>} ⊂ <sup>Q</sup> *and let* <sup>D</sup><sup>ω</sup> := ×<sup>k</sup>∈<sup>ω</sup> <sup>D</sup><sup>k</sup> *be the set of all sequences* (ϕk)<sup>k</sup> *with* <sup>ϕ</sup><sup>k</sup> <sup>∈</sup> <sup>D</sup><sup>k</sup> *for all* <sup>k</sup> <sup>∈</sup> <sup>ω</sup>*. By default,* <sup>D</sup><sup>ω</sup> *is endowed with the Baire space topology, which corresponds to that induced by the metric*

$$d: (\varphi, \psi) \mapsto \begin{cases} 0 & \text{if } \varphi = \psi\\ 1/\min\{1 + n : n \in \omega, \varphi\_n \neq \psi\_n\} & \text{otherwise}. \end{cases}$$


Using a standard product construction we can easily generalise the notion of ξ-names to ξ<sup>n</sup>-names of R<sup>n</sup>. When clear from the context, we will drop n and just write ξ to denote the corresponding generalised representation D<sup>n</sup> <sup>ω</sup> <sup>→</sup> <sup>R</sup><sup>n</sup>.

Computable equivalence between two representations not only implies that there are continuous maps between them but also that names can computably be transformed [31]. Since the Cauchy representation itself is continuous [4] we derive continuity of ξ, which is used below to show compactness of preimages <sup>ξ</sup>−1(X) of compact sets <sup>X</sup> <sup>⊆</sup> <sup>R</sup> under <sup>ξ</sup>. All proofs can be found in [8].

Lemma 5. *The following properties hold for* ξ*.*


The converse of Item <sup>2</sup> does not hold. An example for a Cauchy-name of <sup>0</sup> <sup>∈</sup> <sup>R</sup> is the sequence (xn)<sup>n</sup> with <sup>x</sup><sup>n</sup> = (−2)−<sup>n</sup> for all <sup>n</sup> <sup>∈</sup> <sup>ω</sup>, which does not satisfy <sup>∀</sup>i, j : <sup>|</sup>x<sup>i</sup> <sup>−</sup> <sup>x</sup>i+<sup>j</sup> | ≤ <sup>2</sup>−(i+1). However, given a name of a real number, we can compute a corresponding ξ-name, this is one direction of the property in Item 3.

As a consequence of Item 2 a function-oracle machine M? computing f : <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>R</sup> according to Definition <sup>1</sup> can be run on <sup>ξ</sup>-names of *<sup>x</sup>* <sup>∈</sup> <sup>R</sup><sup>n</sup> leading to valid Cauchy-names of f(*x*). Note that this proposition does not require M? <sup>f</sup> to compute a <sup>ξ</sup>-name of <sup>f</sup>(*x*). Any rational sequence rapidly converging to f(*x*) is a valid output. This means, that the model of computation remains unchanged with respect to the earlier parts of this paper. It is the set of names the machines are operated on, which is restricted. This is reflected in Algorithm 2 by computing dyadic rational approximations *<sup>x</sup>*˜<sup>k</sup> of *<sup>x</sup>*<sup>α</sup> such that *<sup>x</sup>*˜<sup>k</sup> <sup>∈</sup> <sup>D</sup><sup>n</sup> k instead of keeping the name of *<sup>x</sup>*<sup>α</sup> constant as has been done in Algorithm 1.

Algorithm 2 (Local linearisation) Algorithm <sup>δ</sup>-deciding <sup>P</sup> : <sup>f</sup>(*x*) <sup>0</sup> and – in case unsat – computing a linearisation at α or returning "None" and in this case α satisfies Pδ. The function f is computed by machine M? f .


In particular, in Theorem 4 we show that linearisations for the (Lδ) rule can be computed by Algorithm 2, which – in contrast to linearise<sup>δ</sup> in Algorithm 1 – does not require access to a procedure computing an upper bound μ<sup>f</sup> on the uniform modulus of continuity of the non-linear function <sup>f</sup> ∈ Fnl valid on the entire bounded domain. It not just runs the machine M? <sup>f</sup> , but also observes the queries M<sup>ϕ</sup> <sup>f</sup> poses to its oracle in order to obtain a local modulus of continuity of <sup>f</sup> at the point of evaluation. The function approx(*x*, m) := *<sup>x</sup>* · <sup>2</sup><sup>m</sup>+1/2<sup>m</sup>+1 used to define Algorithm <sup>2</sup> computes a dyadic approximation of *<sup>x</sup>*, with · : <sup>Q</sup><sup>n</sup> <sup>→</sup> <sup>Z</sup><sup>n</sup> denoting a rounding operation, that is, it satisfies ∀*q* : *q* − *q*-<sup>≤</sup> <sup>1</sup> <sup>2</sup> . On rationals (our use-case), · is computable by a classical Turing machine.

Definition 7 ([31, Definition 6.2.6]). *Let* <sup>f</sup> : <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>R</sup> *and <sup>x</sup>* <sup>∈</sup> dom <sup>f</sup>*. A function* <sup>γ</sup> : <sup>N</sup> <sup>→</sup> <sup>N</sup> *is called* a (local) modulus of continuity of <sup>f</sup> at *<sup>x</sup> if for all* <sup>p</sup> <sup>∈</sup> <sup>N</sup> *and <sup>y</sup>* <sup>∈</sup> dom <sup>f</sup>*, x* − *y*-<sup>≤</sup> <sup>2</sup>−γ(p) <sup>=</sup>⇒ |f(*x*) <sup>−</sup> <sup>f</sup>(*y*)| ≤ <sup>2</sup>−<sup>p</sup> *holds.*

We note that in most cases a local modulus of continuity of f at *x* is smaller than the best uniform modulus of f on its domain, since it only depends on the local behaviour of f around x. One way of computing a local modulus of f at *x* is using the function-oracle machine M? <sup>f</sup> as defined next.

Definition 8. *Let* M? <sup>f</sup> *compute* <sup>f</sup> : <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>R</sup> *and let <sup>x</sup>* <sup>∈</sup> dom <sup>f</sup> *have Cauchyname* ϕ*. The function* γM? <sup>f</sup> ,ϕ : <sup>p</sup> → max{0, k : <sup>M</sup><sup>ϕ</sup> <sup>f</sup> (<sup>p</sup> + 2) *queries index* <sup>k</sup> *of* <sup>ϕ</sup>} *is called* the effective local modulus of continuity induced by M? <sup>f</sup> at <sup>ϕ</sup>*.*

The effective local modulus of continuity of <sup>f</sup> at a name <sup>ϕ</sup> of *<sup>x</sup>* <sup>∈</sup> dom <sup>f</sup> indeed is a local modulus of continuity of f at *x* [17, Theorem 2.13]. Algorithm 2 computes -full linearisations by means of the effective local modulus [8], as stated next.

Lemma 6. *Let* <sup>P</sup> : <sup>f</sup>(*x*) <sup>0</sup> *be a non-linear constraint in* <sup>N</sup> *and* <sup>α</sup> *be an assignment of <sup>x</sup> to rationals in* dom <sup>f</sup>*. Whenever* <sup>C</sup> <sup>=</sup> LineariseLocalδ*(*f, *<sup>x</sup>*, , α*) and* <sup>C</sup> <sup>=</sup> None*,* <sup>C</sup> *is an -full linearisation of* <sup>P</sup> *at* <sup>α</sup>*, with corresponding to the effective local modulus of continuity induced by* M? <sup>f</sup> *at a* <sup>ξ</sup>*-name of <sup>x</sup>*<sup>α</sup>*.*

Thus, the function lineariseLocal<sup>δ</sup> in Algorithm 2 is a drop-in replacement for linearise<sup>δ</sup> in Algorithm 1 since the condition on returning a linearisation of P versus accepting P<sup>δ</sup> is identical. The linearisations however differ in the radius , which now, according to Lemma 6, corresponds to the effective local modulus of continuity. The resulting procedure we call nlinStepLocalδ. One of its advantages over nlinStep<sup>δ</sup> is running M? <sup>f</sup> on <sup>ξ</sup>-names instead of Cauchy-names, is that they form a compact set for bounded instances, unlike the latter. This allows us to bound > 0 for the computed -full local linearisations of otherwise arbitrary δ-ksmt runs. A proof of the following Lemma showing compactness of preimages <sup>ξ</sup>−1(X) of compact sets <sup>X</sup> <sup>⊆</sup> <sup>R</sup> under <sup>ξ</sup> is given in [8]. ĚĚ

Lemma 7. *Let* <sup>X</sup> <sup>⊂</sup> <sup>R</sup><sup>n</sup> *be compact. Then the set* <sup>ξ</sup>−1(X) <sup>⊂</sup> <sup>D</sup><sup>n</sup> <sup>ω</sup> *of* ξ*-names of elements in* X *is compact as well.*

The proof involves showing ξ−1(X) to be closed and uses the fact that for each component <sup>ϕ</sup><sup>k</sup> of names (ϕk)<sup>k</sup> of *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> there are just finitely many choices from D<sup>k</sup> due to the restriction of the length of the dyadics. This is not the case for the Cauchy representation used in Definition 1 and it is the key for deriving existence of a strictly positive lower bound on the -fullness of linearisations.

Theorem 4. *Let* <sup>δ</sup> <sup>∈</sup> <sup>Q</sup>>0*. For any bounded instance* <sup>L</sup><sup>0</sup> ∧ N *there is* > <sup>0</sup> *such that any* <sup>δ</sup>*-*ksmt *run starting in* (nil,L0, <sup>N</sup> )*, where applications of* (L) *and* (Fsat <sup>δ</sup> ) *are performed according to* nlinStepLocalδ*, is -full.*

*Proof.* Assume <sup>L</sup><sup>0</sup> ∧ N is a bounded instance. Set := min{<sup>P</sup> : <sup>P</sup> ∈N}, where <sup>P</sup> is defined as follows. Let <sup>P</sup> : <sup>f</sup>(*x*)<sup>0</sup> in <sup>N</sup> . Then the closure <sup>D</sup> <sup>P</sup> of the bounded set D<sup>P</sup> is compact. Let E be the set of ξ-names of elements of D <sup>P</sup> <sup>⊆</sup> dom <sup>f</sup> (see Definition 6) and for any <sup>ϕ</sup> <sup>∈</sup> <sup>E</sup> let <sup>k</sup><sup>ϕ</sup> be defined as <sup>γ</sup>M? <sup>f</sup> ,ϕ(p) (see Definition 8) where p is computed from δ as in Algorithm 2 and is independent of ϕ. Since the preimage of each <sup>k</sup><sup>ϕ</sup> is open, the function <sup>ϕ</sup> → <sup>k</sup><sup>ϕ</sup> is continuous. By Lemma <sup>7</sup> the set <sup>E</sup> is compact, thus, there is <sup>ψ</sup> <sup>∈</sup> <sup>E</sup> such that <sup>2</sup>−k<sup>ψ</sup> = inf{2−k<sup>ϕ</sup> : <sup>ϕ</sup> <sup>∈</sup> <sup>E</sup>}. Set <sup>P</sup> := 2−k<sup>ψ</sup> . The claim then follows by Lemma 6.

Thus we can conclude.

Corollary 1. δ*-*ksmt *with local linearisations is a* δ*-complete decision procedure.*

## 7 Conclusion

In this paper we extended the the ksmt calculus to the δ-satisfiability setting and proved that the resulting δ-ksmt calculus is a δ-complete decision procedure for solving non-linear constraints over computable functions which include polynomials, exponentials, logarithms, trigonometric and many other functions used in applications. We presented algorithms for constructing -full linearisations ensuring termination of δ-ksmt. Based on methods from computable analysis we presented an algorithm for constructing local linearisations. Local linearisations exclude larger regions from the search space and can be used to avoid computationally expensive global analysis of non-linear functions.

#### References


D.T., Urban, J., Vu, K.K., Zumkeller, R.: A formal proof of the Kepler conjecture. CoRR abs/1501.02155 (2015)


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Universal Invariant Checking of Parametric Systems with Quantifier-free SMT Reasoning**

Alessandro Cimatti , Alberto Griggio , and Gianluca Redondi

Fondazione Bruno Kessler, Trento, Italy {cimatti, griggio, gredondi}@fbk.eu

**Abstract.** The problem of invariant checking in parametric systems – which are required to operate correctly regardless of the number and connections of their components – is gaining increasing importance in various sectors, such as communication protocols and control software. Such systems are typically modeled using quantified formulae, describing the behaviour of an unbounded number of (identical) components, and their automatic verification often relies on the use of decidable fragments of first-order logic in order to effectively deal with the challenges of quantified reasoning.

In this paper, we propose a fully automatic technique for invariant checking of parametric systems which does not rely on quantified reasoning. Parametric systems are modeled with array-based transition systems, and our method iteratively constructs a quantifier-free abstraction by analyzing, with SMT-based invariant checking algorithms for nonparametric systems, increasingly-larger finite instances of the parametric system. Depending on the verification result in the concrete instance, the abstraction is automatically refined by leveraging canditate lemmas from inductive invariants, or by discarding previously computed lemmas.

We implemented the method using a quantifier-free SMT-based IC3 as underlying verification engine. Our experimental evaluation demonstrates that the approach is competitive with the state of the art, solving several benchmarks that are out of reach for other tools.

**Keywords:** Parametric Systems · Array-based transitions systems · Abstraction-refinement · SMT

#### **1 Introduction**

Parametric systems consist of a finite but unbounded number of components. Examples include communication protocols (e.g. leader election), feature systems, or control algorithms in various application domains (e.g. railways interlocking logics). The key challenge is to prove the correctness of the parametric system for all possible configurations corresponding to instantiations of the parameters.

Parametric systems can be described as symbolic array-based transition systems [10], where the dependence on the configuration is expressed with first-order quantifiers in the initial condition and the transition relation of the model.

In this paper, we propose a fully automated approach for solving the universal invariant problem of array-based systems. The distinguishing feature is that the approach, grounded in SMT, does not require dealing with quantified theories, with obvious computational advantages. The algorithm implements an abstraction-refinement loop, where the abstract space is a quantifier-free transition system over some SMT theories. Our inspiration and starting point is the Parameter Abstraction of [3,15], which we extend in two directions. First, we modify the definition of the abstraction, by introducing a set of different *environment variables*, which intuitively overapproximate the behaviour of all the instances not precisely tracked by the abstraction, and by introducing a special *stuttering transition* in which the environment is allowed to change non-deterministically. Second, we combine the abstraction with a method for *automatically* inferring candidate universal lemmas, which are used to strengthen the abstraction in case of spurious counterexamples. The candidate lemmas are obtained by generalization from the spuriousness proof carried out in a finite-domain instantiation of the concrete system. However, we do not require quantified reasoning to prove that they universally hold; rather, the algorithm takes into account the fact that candidate lemmas may turn out not to be universally valid. In such cases, the method is able to automatically discover such bad lemmas and discard them, by examining increasingly-higher-dimension bounded instances of the parametric system.

We implemented the method in a tool called Lambda. At its core, Lambda leverages modern model checking approaches for quantifier-free infinite-state systems, i.e. the SMT-based approach of IC3 with implicit abstraction [4], in contrast to other approaches [19] where the abstract space is Boolean. In our experimental evaluation, we compared Lambda with the state-of-the-art tools MCMT [11] and Cubicle [7]. The results show the advantage of the approach, that is able to solve multiple benchmarks that are out of reach for its competitors.

The rest of the paper is structured as follows. In Section 2 we present some logical background, and in Section 3 we describe array-based systems. We give an informal overview of the algorithm in Section 4. In Section 5 we define the abstraction and state its formal properties. In Section 6 we discuss the approach to concretization and refinement, and we present the techniques for inferring candidate lemmas. We discuss the related work in Section 7, and we present our experimental evaluation in Section 8. Finally, in Section 9 we draw some conclusions and present directions for future work. For lack of space, the proofs of our theoretical results, as well as further details on our experiments, are reported in an extended techical report [5].

#### **2 Preliminaries**

Our setting is standard first order logic. A theory T in the SMT sense is a pair T = (Σ, C), where Σ is a first order signature and C is a class of models over Σ. A theory T is closed under substructure if its class C of structures is such that whenever M∈C and N is a substructure of M, then N ∈C. We use the standard notions of Tarskian interpretation (assignment, model, satisfiability, validity, logical consequence). We refer to 0-arity predicates as Boolean variables, and to 0-arity uninterpreted functions as (theory) variables. A literal is an atom or its negation. A clause is a disjunction of literals. A formula is in conjunctive normal form (CNF) iff it is a conjuction of clauses. If x1, ..., x<sup>n</sup> are variables and φ is a formula, we might write φ(x1, ..., xn) to indicate that all the variables occurring free in φ are in x1, ..., xn.

If φ is a formula, t is a term and v is a variable which occurs free in φ, we write φ[v/t] for the substitution of every occurrence of v with t. If t and v are vectors of the same length, we write φ[v/t] for the simultaneous substitution of each v<sup>i</sup> with the corresponding term ti. We use an if-then-else notation for formulae. We write **if** φ<sup>1</sup> **then** ψ<sup>1</sup> **elif** φ<sup>2</sup> **then** ψ<sup>2</sup> **elif** ...ψ<sup>n</sup>−<sup>1</sup> **else** ψ<sup>n</sup> to denote the formula (φ<sup>1</sup> → ψ1) ∧ ((¬φ<sup>1</sup> ∧ φ2) → ψ2) ∧ ...((¬φ<sup>1</sup> ... ¬φ<sup>n</sup>−<sup>1</sup> ∧ ¬φn) → ψn).

Given a set of variables v, we denote with v the set {v |v ∈ v}. A symbolic transition system is a triple (v, I(v), T(v, v )), where v is a set of variables, and I(v), T(v, v ) are first order formulae over some signature. An assignment to the variables in v is a state. A state s is initial iff it is a model of I(v), i.e. s |= I(v). The states s, s denote a transition iff s ∪ s |= T(v, v ), also written T(s, s ). A path is a sequence of states s0, s1,... such that s<sup>0</sup> is initial and T(si, s <sup>i</sup>+1) for all i. We denote paths with π, and with π[j] the j-th element of π. A state s is reachable iff there exists a path π such that π[i] = s for some i. A variable v is frozen iff for all π, i it holds that π[i](v) = π[0](v). In the following, when we define a frozen variable v, we assume that this is done by having a constraint v = v as a top-level conjunct of the transition formula. A formula φ(v) is an invariant of the transition system C = (v, I(v), T(v, v )) iff it holds in all the reachable states. Following the standard model checking notation, we denote this with <sup>C</sup> <sup>|</sup><sup>=</sup> <sup>φ</sup>(v).<sup>1</sup>A formula <sup>φ</sup>(v) is an inductive invariant for <sup>C</sup> iff I(v) |= φ(v) and φ(v) ∧ T(v, v ) |= φ(v ).

# **3 Modeling Parametric Systems as Array-based Transition Systems**

In order to describe parametric systems, we adapt from [10] the notion of arraybased systems. In the following, we fix a theory of indexes T<sup>I</sup> = (Σ<sup>I</sup> , C<sup>I</sup> ) and a theory of elements T<sup>E</sup> = (ΣE, C<sup>E</sup>). In order to model the parameters, we require that the class <sup>C</sup><sup>I</sup> is closed under substructure. Then with <sup>A</sup><sup>E</sup> <sup>I</sup> we denote the theory whose signature is Σ = Σ<sup>I</sup> ∪ Σ<sup>E</sup> ∪ {[ ]}, and a model for it is given by a set of total functions from a model of T<sup>I</sup> to a model of T<sup>E</sup>. In general, we can have several array theories with multiple sorts for indexes and elements.

<sup>1</sup> Note that we use the symbol <sup>|</sup>= with three different denotations: if φ, ψ are formulae, φ |= ψ denotes that ψ is a logical consequence of φ; if μ is an interpretation, and ψ is a formula, μ |= ψ denotes that μ is a model of ψ; if C is a transition system, C |= ψ denotes that ψ is an invariant of C.

For simplicity, we fix only an index sort and an elem sort. In the following, an array-based transition system

$$C = (a, \iota(a), \tau(a, a'))$$

is a symbolic transition system, with the additional constraints that:


$$
\exists \underline{i} \forall \underline{j} . \psi(\underline{i}, \underline{j}, a[\underline{i}], a[\underline{j}], a'[\underline{i}], a'[\underline{j}]) .
$$

with ψ a quantifier-free formula.

This syntactic requirement subsumes the common guard and update formalism used for the description of parametric systems, used e.g in [10, 12, 15].

In the following, we shall refer to the disjuncts τ<sup>k</sup> of τ as *transition rules* (or simply *rules* when clear from the context).

An array-based transition system can be seen as a family of transition systems, one for each cardinality of the finite models M<sup>I</sup> of T<sup>I</sup> . In the following, given d an integer, we denote with C<sup>d</sup> *the finite instance of* C *of size d* obtained by instantiating the quantifiers of C over a set of fresh index variables of cardinality d (considered implicitly different from each other). Note that this C<sup>d</sup> is a *symmetric presentation* [15]: if c = {c1,...,c<sup>d</sup>} are the fresh index variables, and σ is a permutation of c, we have that, for every formula φ(c, a[c]), <sup>C</sup><sup>d</sup> <sup>|</sup><sup>=</sup> <sup>φ</sup>(c, a[c]) <sup>⇔</sup> <sup>C</sup><sup>d</sup> <sup>|</sup><sup>=</sup> <sup>φ</sup>(σ(c), a[σ(c)]).

*Example 1 (Mutex Protocol for Ring Topology).* Here we describe a simple protocol for accessing a shared resource, with processes in a ring-shaped topology. As an index theory, we use the finite sets of integers. As an element theory, we use both the Booleans and an enumerated data type of two elements, namely {idle, critical}. The array variable t, with sort index → boolean, is true in an index variable x if x holds the token. The variable s, with sort index → {idle, critical} holds the current state of the process. In addition, we have an integer frozen variable length, which represents the length of the ring. The transition system is described by the following formulae:

**Initial states.** Initially, only one process holds the token, and every process is idle. We model this initial process with an additional constant init token of sort index. Moreover, each index is bounded by the value of length. The initial formula is:

$$\forall j. p[j] = idle \land j \ge 1 \land j \le length \land length > 0$$

$$\land \begin{cases} \text{if } j = init.token \text{ then } t[j] = true \\ \text{else } t[j] = false \end{cases}$$

**Transition rule 1.** A process which holds the token can enter the critical section:

$$\begin{aligned} \exists i. s[i] = idle \land t[i] = true \land s'[i] = critical \land t'[i] = t[i] \land t'\\ \forall j, j \neq i. (s'[j] = s[j] \land t'[j] = t[j]) \end{aligned}$$

**Transition rule 2.** A process exits from the critical section and passes the token to the process at its right:

$$\exists i. \land s[i] = critical \land s'[i] = idle \land t'[i] = false \land t$$

$$\forall j, j \neq i. \begin{cases} \text{if } j = 1 \land i = length \text{ then } s'[j] = s[j] \land t'[j] = true \\ \text{elif } j = i + 1 \land i < length \text{ then } s'[j] = s[j] \land t'[j] = true \\ \text{else } s'[j] = s[j] \land t'[j] = t[j] \end{cases}$$

#### **3.1 Universal invariant problem for array-based systems**

In the following, given an array-based transition system

$$C = (a, \iota(a), \tau(a, a')),$$

the *universal invariant problem* is the problem of proving (or disproving) that a formula of the form Φ def = ∀i.φ(i, a[i]) is an invariant for C.

**Guard Strengthening** In order to prove that ∀i.φ(i, a[i]) is an invariant of a system C = (a, ι(a), τ (a, a )), we can first strengthen the rules of C by adding the candidate invariant in conjunction with the transition relation, and then prove that the formula is an invariant of the newly-restricted system. This induction principle is justified by the following proposition:

**Proposition 1 (Guard strenghtening [15])** *Let* C = (a, ι(a), τ (a, a )) *be a transition system and let* Φ *be* ∀i.φ(i, a[i])*. Let* C<sup>Φ</sup> = (a, ι(a), τ (a, a ) ∧ Φ) *be the guard-strengthening of* C *with respect to* Φ*. Then, if* Φ *is an invariant of* CΦ*, it is also an invariant of* C*.*

**Prophecy variables** The universal quantifiers in the candidate invariant can be replaced with fresh frozen variables, called *prophecy variables*, that intuitively contain the indexes of the processes witnessing the violation of the property.

**Proposition 2 (Removing quantifiers [19])** *Let* C = (a, ι(a), τ (a, a )) *be an array-based system. The formula* ∀i.φ(i, a[i]) *is an invariant for* C *iff the formula* φ(p, a[p]) *is an invariant for* C+<sup>p</sup> = (a∪p, ι(a), τ (a, a ))*, where* p *is a set of fresh frozen variables of index sort.*

For better readability, in the following we will omit the subscript +p. Moreover, we assume that the index variables universally quantified in the candidate invariant are considered to be different. This does not limit expressiveness, and simplifies our discourse. Therefore, the prophecy variables induced by a candidate invariant are considered to be *implicitly different*.

**Fig. 1.** An overview of the algorithm. C is an array-based transition system; Φ is a quantified candidate invariant; Ψ def = {ψ1,...ψ<sup>n</sup>} is the set of candidate lemmas; C<sup>Φ</sup>∧<sup>Ψ</sup> is a quantified transition system resulting from the strengthening of <sup>C</sup>; <sup>C</sup>˜<sup>Φ</sup>∧<sup>Ψ</sup> is a quantifier-free transition system.

#### **4 Overview of the Method**

In the following, let an array-based transition system C def = (a, ι(a), τ (a, a )), and a candidate universal invariant Φ def = ∀i.φ(i, a[i]) for C be given.

We now summarize the algorithm that attempts to solve the universal invariant problem for C and Φ. The algorithm, depicted in Figure either to construct an abstraction sufficiently precise to prove the property (exit with Safe), or to find a finite instantiation of the problem exhibiting a concrete counterexample (exit with Unsafe). The abstract space is quantifier-free, and obtained by instantiating the universally quantified formulae over two sets of index variables: the prophecy variables, which arise from the candidate invariant (as explained in Proposition 2), and are denoted with p; and the *environmental* variables, denoted with x, which arise from the transition formula and are intended to represent the environment surrounding the p indexes, interacting with them in the behaviour leading to the violation. While prophecy variables are frozen, thus representing the same indexes for the whole run, environmental variables are free to change at each time step, hence producing possibly spurious behaviours. The algorithm maintains a set of *candidate lemmas* Ψ def = {Ψ<sup>i</sup>}<sup>i</sup>, composed of universally quantified formulae, that are used to strengthen the property and to tighten the abstraction. Initially, Ψ is empty. In the following, if C<sup>d</sup> is a finite instance of C and Φ is a candidate universal invariant, with Φ<sup>d</sup> we denote the formula obtained from Φ by instantiating the quantifiers in variables used for the domain of cardinality d. 1, iterates trying

At each iteration, we carry out the following high-level steps (described in detail in the next sections):


When the algorithm terminates with Unsafe, we are able to exhibit a finite counterexample trace in a finite instance of C violating the property. When the algorithm terminates with safe, then the property holds in C. The result is obtained by the following chain of implications: from Theorem 3, stated in the next section, we have that <sup>C</sup>˜<sup>Φ</sup>∧<sup>Ψ</sup> <sup>|</sup><sup>=</sup> <sup>Φ</sup>˜ <sup>∧</sup> <sup>Ψ</sup>˜ implies <sup>C</sup><sup>Φ</sup>∧<sup>Ψ</sup> <sup>|</sup><sup>=</sup> <sup>Φ</sup>˜ <sup>∧</sup> <sup>Ψ</sup>˜. From Proposition 2, we have that C<sup>Φ</sup>∧<sup>Ψ</sup> |= Φ ∧ Ψ. Therefore, from Proposition 1, we have C |= Φ ∧ Ψ. In particular, we have C |= Φ.

## **5 Modified Parameter Abstraction**

We describe here our Parameter Abstraction. The first version of this approach was introduced in [3], and later formalized in [15]. In the following, we describe

<sup>2</sup> In the following, with <sup>Φ</sup> <sup>∧</sup> <sup>Ψ</sup> we denote the prenex form <sup>Φ</sup> <sup>∧</sup> - <sup>i</sup> Ψ<sup>i</sup>

a novel version of the abstraction, and how it can be applied to array-based transition systems. The main novelty is that, instead of using a special abstract index "∗" that overapproximates the behaviour of the system in the array locations that are not explicitly tracked, we use n *environmental (index) variables* which are not abstracted, but are allowed to change nondeterministically in some transitions. This can be achieved by the usage of an additional **stuttering transition**: this rule allows the environmental variables to change value arbitrarily, while not changing the values of the array in the prophecies.

#### **5.1 Abstraction Computation**

Let an array-based transition system C and a universal invariant Φ be given<sup>3</sup>. By conjoining Φ to the transition rules in C, we obtain CΦ, the guard strengthening of C with respect to Φ. Then, we define two sets of variables: the prophecy variables p, in number determined by Proposition 2, and the environmental variables x, in number determined by the greatest existential quantification depth in the transition rules of CΦ. While the prophecies are frozen variables, the interpretation of the environmental variables is not fixed. Moreover, we assume that the values taken by p and x are different. We now define C˜, the parameter abstraction of C.

**Initial formula** Let ι(a) be ∀i.φ(i, a[i]), the initial formula of C in prenex form, with φ(i, a[i]) quantifier-free. The initial formula of the abstract system is a quantifier-free first order formula, denoted ˜ι(p, a[p]) obtained by instantiating all the universal quantifiers in ι over the set of prophecy variables p.

**Transition formula** The transition formula of C<sup>Φ</sup> is still represented by a disjuction of formulae of the form<sup>4</sup>

$$
\tau(a, a') \quad \stackrel{\text{def}}{=} \quad \exists \underline{\forall} \underline{\forall} \underline{\forall} \psi(\underline{i}, \underline{j}, a[\underline{i}], a[\underline{j}], a'[\underline{i}], a'[\underline{j}]) .
$$

For simplicity, we can assume that we have only one rule τ (a, a ). First, we compute the set of all substitutions of the i over p ∪ x, and we consider the set of formulae {τ˜<sup>j</sup> (p, x, a, a )}, where j ranges over the substitutions, and ˜τ<sup>j</sup> is the result of applying the substitution to τ .

Then, for each formula in the set {τ˜<sup>j</sup>}, we instantiate the universal quantifiers over the set p ∪ x, obtaining a quantifier-free formula over prophecy and environmental variables.

Moreover, we consider an additional transition formula, called the **stuttering transition**, defined by:

$$\tilde{\tau}\_S \stackrel{\text{def}}{=} \bigwedge\_{p \in \underline{p}} a'[p] = a[p] \wedge p' = p'$$

<sup>3</sup> These represent the system and the property in input to each iteration of the loop.

<sup>4</sup> Possibly by performing trivial logical manipulations to distribute the guard strengthening inside the rules.

The disjunction of all the abstracted transition formulae is the transition formula ˜τ . So, we can now define the transition system

> <sup>C</sup>˜ def = ({a, p, x}, ˜ι(p, a[p]), <sup>τ</sup>˜(p, x, a[<sup>p</sup> <sup>∪</sup> <sup>x</sup>], a [p ∪ x])).

*Example 2.* We apply the abstraction procedure to the transition rule 2 of the token in the ring protocol of Example 1.

Since the invariant is the formula ∀i, j.¬(s[i] = critical ∧ s[j] = critical) it follows that we have two prophecy variables p1, p2. Recall that the invariant itself is added to the transition as an additional conjunct. Since the existential quantification depth is one, we have only one environment variable x1. In the abstraction system we obtain three transition formulae from the original transition; we report the one indexed by the substitution mapping i into x1; such a formula is equivalent to the following:

$$s[x\_1] = crit \land t[x\_1] = true \land s'[x\_1] = idle \land t'[x\_1] = false \land t'$$

 <sup>j</sup>∈{p1,p2} ⎧ ⎪⎨ ⎪⎩ **if** j = 1 ∧ x<sup>1</sup> = length **then** s [j] = s[j] ∧ t [j] = *false* **elif** j = x<sup>1</sup> + 1 ∧ x<sup>1</sup> < length **then** s [j] = s[j] ∧ t [j] = *false* **else** s [j] = s[j] ∧ t [j] = t[j] i,j∈{p1,p2,x1} <sup>i</sup>=<sup>j</sup> ¬(s[i] = critical ∧ s[j] = critical)

#### **5.2 Stuttering Simulation**

We define here the stuttering simulation induced by our version of the Parameter Abstraction. The proof of the main theorem can be found in the appendix. The stuttering is induced by ˜τS: this is a weaker version than the simulation induced by [15], yet it is sufficient for preserving invariants.

**Definition 1 (Stuttering simulation)** *Given two symbolic transition systems* C<sup>1</sup> = (x1, ι1, τ1) *and* C<sup>2</sup> = (x2, ι2, τ2)*, with sets of states* S<sup>1</sup> *and* S2*, a stuttering simulation* S *is a relation* S ⊂ S<sup>1</sup> × S2*, such that:*


*If such a relation exists, we say that* C<sup>2</sup> *stutter simulates* C1*.*

We write S(s1) for {s2|(s1, s2)}∈S. We recall that stutter simulation preserves reachability, i.e. if C<sup>2</sup> stutter simulates C1, then if s<sup>1</sup> is reachable in C<sup>1</sup> then the set S(s1) is reachable in C2. Formally, the stuttering simulation induced by the Parameter Abstraction is defined as follows.

**Definition 2 (Simulation)** *Let* C *be the original transition system and let* C˜ *be its Parameter Abstraction. Let* s *and* s˜ *denote states of* C *and* C˜*, respectively. We define* S *as follows:*

$$\mathcal{S}(s,\tilde{s})\ \operatorname{iff} s(a)[i] = \tilde{s}(a)[i]\ \text{for all } i \in \bigcup\_{p \in \underline{p}} \tilde{s}(p).$$

Intuitively, we require that in the concrete state s and the abstract state ˜s, the array is interpreted in the same way for all the locations referred by the prophecy variables. We then have the following:

**Theorem 3.** *The relation* <sup>S</sup> *is a stuttering simulation between* <sup>C</sup> *and* <sup>C</sup>˜*. Moreover, if* <sup>C</sup>˜ <sup>|</sup><sup>=</sup> <sup>Φ</sup>(p, a[p])*, then* <sup>C</sup> <sup>|</sup><sup>=</sup> <sup>Φ</sup>(p, a[p])*.*

#### **6 Refinement**

If Φ(p, a[p]) does not hold in C˜, in general we cannot conclude anything, since the abstraction could be too coarse. So, if an abstract counterexample is encountered, we try to explore a small instance of the system to see if this counterexample occurs in it. To choose the appropriate size, our algorithm keeps a counter d, whose value is equal to the size to explore. Initially, d is equal to the number of (universally-quantified) index variables in the property Φ. <sup>5</sup> When an abstract counterexample is encountered, we check whether <sup>C</sup><sup>d</sup> <sup>|</sup>= (Φ∧Ψ)<sup>d</sup>. For this check, we use a model checker able to return, in case of success, an inductive invariant I<sup>d</sup>. From the inductive invariant we compute some first order formulae J which will be a new set of candidate lemmas. We will see later how to obtain this generalization. After computing the new lemmas, we set d = d + 1. If a concrete counterexample is found, then there are two cases: (i) the counterexample falsifies the original property, and we exit from the algorithm with a concrete counterexample; (ii) the counterexample falsifies some lemmas; in this case we remove the lemma and restart the loop (without changing d).

#### **6.1 From Invariants to Universal Lemmas**

**Definition 3** *Let* d *be an integer, and let* I<sup>d</sup> *be a set of clauses containing* d *variables. A generalization of* I<sup>d</sup> *is a first-order formula* J *such that, when evaluating the quantifiers in* J *in a domain with precisely* d *elements, we obtain a formula equivalent to* I<sup>d</sup>*.*

We use the following technique for generalization. Suppose that I<sup>d</sup> is in CNF, and that we used c1,...,c<sup>d</sup> as variables for an instance with d elements. Then, <sup>I</sup><sup>d</sup> <sup>=</sup> <sup>C</sup><sup>1</sup> ∧···∧ C<sup>n</sup> is a conjunction of clauses. From each of those clauses we

<sup>5</sup> Recall that we assume that quantified index variables are required to be different. Therefore, the property holds vacuously on instances of size smaller than the number of index variables in Φ.

will obtain a new candidate lemma. Let *AllDiff* (i) be the formula which states that all variables in i are different from each other. Since every C<sup>d</sup> is given by a symmetric presentation [15], we have that, for every <sup>i</sup> ∈ {1,...,n}, <sup>C</sup><sup>d</sup> <sup>|</sup><sup>=</sup> ∀i1,...,ih.*AllDiff* (i1,...,ih) → Ci(i1,...,ih), where the quantifiers range over c1,...,c<sup>d</sup> and h ≤ d is the number of variables which occur in Ci. This means that J def = <sup>i</sup> <sup>∀</sup>i.*AllDiff* (i) → Ci(i) is a generalization of <sup>I</sup>d. In our algorithm, we add the set {∀i.Ci(i)}<sup>n</sup> <sup>i</sup>=1 of new candidate lemmas to Ψ. Note that we omitted the formula *AllDiff* for our assumption on the different values of index variables.

**Fixing Unsound Lemmas** Unfortunately, we know a priori that a lemma holds only for the instance from which it was generalized. In general, its universal generalization obtained as outlined above might not hold in the system.

Suppose that the formula ψ<sup>1</sup> is a candidate lemma, obtained by generalization after the successful verification of an instance of size d. Suppose that later, a counterexample for ψ<sup>1</sup> is found by exploring a different instance C<sup>d</sup>- (with d > d). This means that the lemma ψ<sup>1</sup> does not hold universally, but only for some finite instances of the system (including C<sup>d</sup>), and not in general. In this case, we simply remove ψ<sup>1</sup> from the set of candidate lemmas Ψ, thus effectively weakening our working property (from Φ ∧ Ψ to Φ ∧ (Ψ \ {ψ1})). While this may cause a particular (abstract) counterexample to be encountered more than once during the main loop of the algorithm, since the finite instances are explored monotonically and their size d is increased after every successful verification of a bounded instance, the overall procedure still makes progress by exploring increasingly-large instances of the system. The hope is that eventually the algorithm will discover enough good lemmas that block the abstract counterexample. This notion of (weak) progress is justified by the following:

**Proposition 4** *Let* π˜ *be an abstract counterexample,* Ψ *be the current set of universally quantified lemmas, and* d *be the size of the bounded instance to explore. During every execution of the algorithm, the same triple* (˜π, Ψ, d) *never occurs twice.*

#### **7 Related Work**

Parametric verification is a challenging problem, and there is a large body of work in the literature devoted to this problem. Here, we (necessarily) focus on the approaches that are most related to ours.

Several methods are based on quantifier elimination using decidable fragments of first order logic, with notable examples in [7, 10, 22]. These methods guarantee a high degree of automation, but typically impose strong syntactic requirements in the input problem, and may suffer from scalability issues. A second popular approach is based on abstraction and abstraction refinement. Within this family of abstractions, earlier versions of the Paramater Abstraction [3, 15] have been used successfully also for industrial protocols [24]. The main drawback is that the degree of automation is limited, and substantial expertise is required to obtain the desired results. The first steps of our abstraction algorithm are inspired by the ones in [19] and [15]. The key difference from [19] is that in that work the abstract transition system C˜ is given by an eager propositional abstraction, with the axioms of the background theories recovered by the usage of some schemata. Here we retain the theory of arrays in the abstract space C˜. Moreover, differently from both [15] and [19], our procedure includes an automatic refinement of the abstraction in a counterexample-driven manner.

Ivy [20, 22] implements both semi-automatic invariant checking with decidable logics (namely, Effectively Propositional Logic – EPR) and compositional abstraction with eager axioms [19]. MyPyvy [13,14] is a model checker inspired by the language of Ivy. It implements a version of IC3 capable of dealing with universal formulas [13]; the algorithm is completely automatic, but it is still based on quantifier elimination via reduction to decidable logics. In a more recent work, MyPyvy has gained the capability of inferring invariants with quantifier alternations, using a procedure that combines separators and first-order logic [14]. At the moment, our framework is capable of handling only universally quantified invariants. On the other hand, our approach is not limited to EPR, but it can in principle handle formulae with arbitrary SMT theories.

Exploring small instances of a parameterized system for candidate lemmas is a popular approach for parametric verification. In [8], this idea is used to over-approximate backward reachable states inside an algorithm which combines backward search and quantifier elimination. In [16], a finite-instance exploration is used together with a theorem prover to check the validity of candidate lemmas. In [17], candidate invariants are obtained from the set of reachable states of small instances. Similarly to our approach, these lemmas are used to strengthen an earlier version of the parameter abstraction. However, human intervention is still needed for the refinement.

A similar approach is presented in [23], where lemmas are obtained from a generalization of the proof of the property in a small instance of the protocol. The main difference with our technique, besides the methods used to extract such invariants, is the following: in [23], the authors show that to prove that a property (conjoined with lemmas) is inductive for all N, it is enough to prove that it is inductive for a particular N0, which is computable from the number of variables in the description of the system. This result is obtained from the imposed syntactic structure of the system. On the other hand, we impose less structure, and we rely on proving the property in an abstract version (and not a concrete instance) of the system. Moreover, our approach is integrated in an abstraction/refinement loop, which is missing from [23].

Another SMT-based approach for parametric verification is in [12]. The method is based on a reduction of invariant checking to the satisfiability of non-linear Constrained Horn Clauses (CHCs). Besides differing substantially in the overall approach, the method is more restrictive in the input language, and handles invariants only with a specific syntactic structure.

The use of prophecy variables for inferring universally quantified invariants has been explored also in non-parametric contexts, such as [18]. The main difference with our work is that [18] focuses on finding quantified invariants for quantifier-free transition systems with arrays, rather than array-based systems with quantifiers. The overall abstraction-refinement approach is also substantially different.

#### **8 Experimental Evaluation**

We have implemented our algorithm in a tool called Lambda (for **L**earning **A**bstractions fro**M B**ounde**D A**nalysis). Lambda is written in Python, and uses the SMT-based IC3 with implicit predicate abstraction of [4] as underlying quantifier-free verification engine.<sup>6</sup> Lambda accepts as input array-based systems specified either in the language of MCMT [11] or in VMT format (a light-weight extension of SMT-LIB to model transition systems [25]). In case of successful termination, Lambda generates either a counterexample trace (for violated properties) in a concrete instance of the parametric system, or a quantified inductive invariant that proves the property for any instance of the system. In the latter case, Lambda can also generate proof obligations that can be independently checked with an SMT solver supporting quantifiers, such as Z3 [21] or CVC4 [2]. More specifically, the quantified inductive invariant can be generated by Lambda by simply universally quantifying all the (index) variables in the inductive invariant generated for C˜, and conjoining it with the lemmas Ψ discovered during the main loop iterations. Computing such an invariant is immediate after the termination of the algorithm, and does not require additional reasoning.

In order to evaluate the effectiveness of our method, we have compared Lambda with two state-of-the-art tools for the verification of array-based systems, namely Cubicle [7] and MCMT. We could not include MyPyvy in the comparison, due to the many differences in input languages and modeling formalisms, which make an automatic translation of the benchmarks very difficult. We would also have liked to compare with the technique of [12], however the prototype tool mentioned in the paper doesn't seem to be available.

For our evaluation, we have collected a total of 116 benchmarks, divided in three different groups:

**Protocols** consists of 42 instances taken from the MCMT or the Cubicle distributions, and used in previous works on verifcation of array-based systems. We have used all the instances which were available in both input formats, and we have split benchmarks containing multiple properties into different files.

**DynArch** consists of 57 instances of verification problems of dynamic architectures, taken from [6]. These benchmarks make use of arithmetic constraints on

<sup>6</sup> In our implementation, we use the theory of integers as an index theory. At first, this may seem odd, since we should consider all finite subsets of the integers. However, this is not a problem, since the satisfiability of a quantifier-free UFLIA formula is equivalent to its satisfiability in a finite index model.


**Table 1.** Summary of experimental results.

index terms, which are not supported by Cubicle. Therefore, we could only compare Lambda with MCMT on them.

**Trains** consists of 17 instances derived by (a simplified version of) verification problems on railway interlocking logics [1]. These benchmarks make use of several features that are not fully supported by Cubicle and MCMT (such as non-functional updates in the transition relation, transition rules with more than one universally-quantified variable, real-valued variables). None of such restrictions applies to Lambda, which in general accepts models with significatly fewer syntactic constraints than Cubicle and MCMT. Since these instances are inspired by relevant real-world verification problems, we believe that it is interesting to include them in the evaluation even though we could only run Lambda on them.

Our implementation, all the benchmarks, and the scripts for reproducing the results are available at http://es.fbk.eu/people/griggio/papers/cade21-lambda. tar.gz. We have run our experiments on a cluster of machines with a 2.90GHz Intel Xeon Gold 6226R CPU running Ubuntu Linux 20.04.1, using a time limit of 1 hour and a memory limit of 4GB for each instance. We have used the default settings for MCMT, whereas for Cubicle we have also enabled the BRAB algorithm.<sup>7</sup> A summary of the results of our evaluation are presented in Table 1. More details are provided in our extended version [5].

Overall, Lambda is very competitive with the state of the art, and in fact it solves the largest number of instances (even when disregarding the Trains group, which cannot be handled by the other tools).When considering the Protocols group, Cubicle is often significantly faster than Lambda, especially on easier problems, thanks to its explicit-state exploration component (part of the BRAB algorithm). However, the symbolic techniques used by Lambda allow it to generally scale better to larger, more challenging problems: in the end, Lambda solves 4 more instances than Cubicle, and 10 more than MCMT. The situation is different for the DynArch group, in which Lambda and MCMT solve the same number of instances. However, it is interesting to observe that both tools can solve 5 instances that the other tool cannot solve; more in general, it seems that the two approaches have somewhat complementary strengths. Moreover, as already stated above, the fact that Lambda imposes significantly less syntactic restrictions than the other two tools considered allowed it to handle all the instances of the Trains group, which cannot be easily modeled in the languages of MCMT or Cubicle.

<sup>7</sup> The results reported were obtained using -brab 2; we have however experimented also with other (small) values for -brab, without noticing any significant difference.

Finally, we wish to remark that we have generated SMT proof obligations for checking the correctness of all the (universally quantified) inductive invariants produced by Lambda, and checked them with both CVC4 and Z3. None of the solvers reported any error, and overall the combination of the two solvers was able to successfully verify all the proof obligations for 65 of the 67 instances reported as safe.<sup>8</sup> We believe that the fact that we can easily produce proof obligations that can be independently checked is another strength of our approach. This is in contrast to the approach of Cubicle, where generating proof obligations is nontrivial [9].

#### **9 Conclusions**

In this paper we tackled the problem of universal invariant checking for parametric systems. We proposed a fully-automated abstraction-refinement approach, based on quantifier-free reasoning. The abstract model, that stutter simulates the concrete model, is a quantifier-free symbolic transition system refined by (the instantiation of) candidate universal lemmas. These are obtained by analyzing the proofs of validity of the property in a finite instance of the parametric system. We experimentally evaluated an implementation on standard benchmarks from the literature. The results show the effectiveness of the method, also in comparison with state-of-the-art tools (Cubicle, MCMT). We are able to prove, in a fully automated manner and without manual intervention, several benchmarks that are considered challenging. In the future, we plan to work on generalization, to improve the ability of inferring the right lemmas from a small instance, and to find more effective ways to filter out bad candidates. On the theoretical side, we will investigate the relation between the termination of the algorithm and decidable classes of parametric systems (e.g. those that enjoy a cut-off property). Finally, we will work on the verification of temporally extended properties which are also preserved by stuttering simulations (such as fragments of Linear Temporal Logic).

#### **References**


<sup>8</sup> In the remaining two cases, both solvers returned unknown when trying to prove the validity of some of the proof obligations.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Politeness and Stable Infiniteness: Stronger Together**

Ying Sheng1(B) , Yoni Zohar<sup>1</sup> , Christophe Ringeissen<sup>2</sup> , Andrew Reynolds<sup>3</sup> , Clark Barrett<sup>1</sup> , and Cesare Tinelli<sup>3</sup>

<sup>1</sup> Stanford University, Stanford, CA, USA <sup>2</sup> Universit´e de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France <sup>3</sup> The University of Iowa, Iowa City, IA, USA

**Abstract.** We make two contributions to the study of polite combination in satisfiability modulo theories. The first is a separation between politeness and strong politeness, by presenting a polite theory that is not strongly polite. This result shows that proving strong politeness (which is often harder than proving politeness) is sometimes needed in order to use polite combination. The second contribution is an optimization to the polite combination method, obtained by borrowing from the Nelson-Oppen method. The Nelson-Oppen method is based on guessing arrangements over shared variables. In contrast, polite combination requires an arrangement over *all* variables of the shared sorts. We show that when using polite combination, if the other theory is stably infinite with respect to a shared sort, only the shared variables of that sort need be considered in arrangements, as in the Nelson-Oppen method. The time required to reason about arrangements is exponential in the worst case, so reducing the number of variables considered has the potential to improve performance significantly. We show preliminary evidence for this by demonstrating a speed-up on a smart contract verification benchmark.

## **1 Introduction**

Solvers for satisfiability modulo theories (SMT) [5] are used in a wide variety of applications. Many of these applications require determining the satisfiability of formulas with respect to a combination of background theories. In order to make reasoning about combinations of theories modular and easily extensible, a combination framework is essential. Combination frameworks provide mechanisms for automatically deriving a decision procedure for the combined theories by using the decision procedures for the individual theories as black boxes. To integrate a new theory into such a framework, it then suffices to focus on the decoupled decision procedure for the new theory alone, together with its interface to the generic combination framework.

In 1979, Nelson and Oppen [16] proposed a general framework for combining theories with disjoint signatures. In this framework, a quantifier-free formula in

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5 9 148–165, 2021.

the combined theory is purified to a conjunction of formulas, one for each theory. Each pure formula is then sent to a dedicated theory solver, along with a guessed arrangement (a set of equalities and disequalities that capture an equivalence relation) of the variables shared among the pure formulas. For completeness [15], this method requires all component theories to be stably infinite. While many important theories are stably infinite, some are not, including the widely-used theory of fixed-length bit-vectors. To address this issue, the polite combination method was introduced by Ranise et al. [17], and later refined by Jovanovic and Barrett [12]. In polite combination, one theory must be polite, a stronger requirement than stable-infiniteness, but the requirement on the other theory is relaxed: specifically, it need not be stably infinite. The price for this generality is that unlike the Nelson-Oppen method, polite combination requires guessing arrangements over all variables of certain sorts, not just the shared ones. At a high level, polite theories have two properties: smoothness and finite witnessability (see Section 2). The polite combination theorem in [17] contained an error, which was identified in [12]. A fix was also proposed in [12], which relies on stronger requirements for finite witnessability. Following Casal and Rasga [8], we call this strengthened version strong finite witnessability. A theory that is both smooth and strongly finitely witnessable is called strongly polite.

This paper makes two contributions. First, we give an affirmative answer to the question of whether politeness and strong politeness are different notions, by giving an example of a theory that is polite but not strongly polite. The given theory is over an empty signature and has two sorts, and was originally studied in [8] in the context of shiny theories. Here we state and prove the separation of politeness and strong politeness, without using shiny theories. Proving that a theory is strongly polite is harder than proving that it is just polite. This result shows that the additional effort is sometimes needed in order to be able to use the combination theorem from [12]. We show that for empty signatures, at least two sorts are needed to present a polite theory that is not strongly polite. However, for the empty signature with only one sort, there is a finitely witnessable theory that is not strongly finite witnessable. Such a theory cannot be smooth.

Second, we explore different polite combination scenarios, where additional information is known about the theories being combined. In particular, we improve the polite combination method for the case where one theory is strongly polite w.r.t. a set <sup>S</sup> of sorts and the other is stably infinite w.r.t. a subset <sup>S</sup> <sup>⊆</sup> <sup>S</sup> of the sorts. For such cases, we show that it is possible to perform Nelson-Oppen combination for <sup>S</sup> and polite combination for <sup>S</sup> \ <sup>S</sup> . This means that for the sorts in S , only shared variables need to be considered for the guessed arrangement, which can considerably reduce its size. We also show that the set of shared variables can be reduced for a couple of other variations of conditions on the theories. Finally, we present a preliminary case study using a challenge benchmark from a smart contract verification application. We show that the reduction of shared variables is evident and significantly improves the solving time. Verification of smart contracts using SMT (and the analyzed benchmark in particular) is the main motivation behind the second contribution of this paper.

Related Work: Polite combination is part of a more general effort to replace the stable infiniteness symmetric condition in the Nelson-Oppen approach with a weaker condition. Other examples of this effort include the notions of shiny [21], parametric [13], and gentle [11] theories. Gentle, shiny, and polite theories can be combined `a la Nelson-Oppen with any arbitrary theory. Shiny theories were introduced by Tinelli and Zarba [21] as a class of mono-sorted theories. Based on the same principles as shininess, politeness is particularly well-suited to deal with theories expressed in many-sorted logic. Polite theories were introduced by Ranise et al. [17] to provide a more effective combination approach compared to parametric and shiny theories, the former requiring solvers to reason about cardinalities and the latter relying on expensive computations of minimal cardinalities of models. Shiny theories were extended to many-sorted signatures in [17], where there is a sufficient condition for their equivalence with polite theories. For the mono-sorted case, a sufficient condition for the equivalence of shiny theories and strongly polite theories was given by Casal and Rasga [7]. In later work [8], the same authors proposed a generalization of shiny theories to many-sorted signatures different from the one in [17], and proved that it is equivalent to strongly polite theories with a decidable quantifier-free fragment. The strong politeness of the theory of algebraic datatypes [4] was proven in [18]. That paper also introduced additive witnesses, that provided a sufficient condition for a polite theory to be also strongly polite. In this paper we present a theory that is polite but not strongly polite. In accordance with [18], the witness that we provide for this theory is not additive.

The paper is organized as follows. Section 2 provides the necessary notions from first-order logic and polite theories. Section 3 discusses the difference between politeness and strong politeness and shows they are not equivalent. Section 4 gives the improvements for the combination process under certain conditions, and Section 5 demonstrates the effectiveness of these improvements for a challenge benchmark. <sup>4</sup>

#### **2 Preliminaries**

#### **2.1 Signatures and Structures**

We briefly review the usual definitions of many-sorted first-order logic with equality (see [10,19] for more details). A signature <sup>Σ</sup> consists of a set <sup>S</sup><sup>Σ</sup> (of sorts), a set F<sup>Σ</sup> of function symbols, and a set P<sup>Σ</sup> of predicate symbols. We assume SΣ, F<sup>Σ</sup> and P<sup>Σ</sup> are countable. Function symbols have arities of the form <sup>σ</sup><sup>1</sup> <sup>×</sup> ... <sup>×</sup> <sup>σ</sup><sup>n</sup> <sup>→</sup> <sup>σ</sup>, and predicate symbols have arities of the form <sup>σ</sup><sup>1</sup> <sup>×</sup> ... <sup>×</sup> <sup>σ</sup>n, with <sup>σ</sup>1,...,σn, σ ∈ SΣ. For each sort <sup>σ</sup> ∈ SΣ, <sup>P</sup><sup>Σ</sup> includes an equality symbol <sup>=</sup><sup>σ</sup> of arity <sup>σ</sup> <sup>×</sup> <sup>σ</sup>. We denote it by = when <sup>σ</sup> is clear from context. When =<sup>σ</sup> are the only symbols in Σ, we say that Σ is empty. If two signatures share no symbols except =<sup>σ</sup> we call them disjoint. We assume an underlying countably

<sup>4</sup> Due to space constraints, some proofs are omitted. They can be found in an extended version at https://arxiv.org/abs/2104.11738.

infinite set of variables for each sort. Terms, formulas, and literals are defined in the usual way. For a Σ-formula φ and a sort σ, we denote the set of free variables in φ of sort σ by varsσ(φ). This notation naturally extends to varsS(φ) when S is a set of sorts. vars (φ) is the set of all free variables in φ. We denote by QF(Σ) the set of quantifier-free Σ-formulas.

A Σ-structure is a many-sorted structure that provides semantics for the symbols in Σ (but not for variables). It consists of a domain σ<sup>A</sup> for each sort <sup>σ</sup> ∈ SΣ, an interpretation <sup>f</sup> <sup>A</sup> for every <sup>f</sup> ∈ FΣ, as well as an interpretation <sup>P</sup> <sup>A</sup> for every <sup>P</sup> ∈ PΣ. We further require that =<sup>σ</sup> be interpreted as the identity relation over <sup>σ</sup><sup>A</sup> for every <sup>σ</sup> ∈ SΣ. A <sup>Σ</sup>-interpretation <sup>A</sup> is an extension of a Σ-structure with interpretations for some set of variables. For any Σ-term α, <sup>α</sup><sup>A</sup> denotes the interpretation of <sup>α</sup> in <sup>A</sup>. When <sup>α</sup> is a set of <sup>Σ</sup>-terms, <sup>α</sup><sup>A</sup> <sup>=</sup> - <sup>x</sup><sup>A</sup> <sup>|</sup> <sup>x</sup> <sup>∈</sup> <sup>α</sup> . Satisfaction is defined as usual. A |<sup>=</sup> <sup>ϕ</sup> denotes that <sup>A</sup> satisfies <sup>ϕ</sup>.

<sup>A</sup> <sup>Σ</sup>-theory <sup>T</sup> is a class of all <sup>Σ</sup>-structures that satisfy some set Ax of <sup>Σ</sup>-sentences. For each such set Ax, we say that <sup>T</sup> is axiomatized by Ax. A <sup>Σ</sup>interpretation whose variable-free part is in T is called a T -interpretation. A <sup>Σ</sup>-formula <sup>φ</sup> is <sup>T</sup> -satisfiable if A |<sup>=</sup> <sup>φ</sup> for some <sup>T</sup> -interpretation <sup>A</sup>. A set <sup>A</sup> of <sup>Σ</sup>-formulas is <sup>T</sup> -satisfiable if A |<sup>=</sup> <sup>φ</sup> for every <sup>φ</sup> <sup>∈</sup> <sup>A</sup>. Two formulas <sup>φ</sup> and <sup>ψ</sup> are T -equivalent if they are satisfied by the same T -interpretations.

Note that for any class <sup>C</sup> of <sup>Σ</sup>-structures there is a theory <sup>T</sup><sup>C</sup> that corresponds to it, with the same satisfiable formulas: the Σ-theory axiomatized by the set Ax of <sup>Σ</sup>-sentences that are satisfied in every structure of <sup>C</sup>. In the examples that follow, we define theories T<sup>C</sup> implicitly by specifying only the class C, as done in the SMT-LIB 2 standard [2]. This can be done without loss of generality.

Example 1. Let ΣList be a signature of finite lists containing the sorts elem1, elem2, and list, as well as the function symbols cons of arity elem1×elem2×list → list, car<sup>1</sup> of arity list → elem1, car<sup>2</sup> of arity list → elem2, cdr of arity list → list, and nil of arity list. The <sup>Σ</sup>List-theory <sup>T</sup>List corresponds to an SMT-LIB 2 theory of algebraic datatypes [2,4], where elem<sup>1</sup> and elem<sup>2</sup> are interpreted as some sets (of "elements"), and list is interpreted as finite lists of pairs of elements, one from elem<sup>1</sup> and the other from elem2. cons is a list constructor that takes two elements and a list, and inserts the two elements at the head of the list. The pair (car1(l), car2(l)) is the first entry in l, and cdr(l) is the list obtained from l by removing its first entry. nil is the empty list.

Example 2. The signature ΣInt includes a single sort int, all numerals 0, 1,..., the function symbols +, − and · of arity int × int → int and the predicate symbols <sup>&</sup>lt; and <sup>≤</sup> of arity int <sup>×</sup> int. The <sup>Σ</sup>Int-theory <sup>T</sup>Int corresponds to integer arithmetic in SMT-LIB 2, and the interpretation of the symbols is the same as in the standard structure of the integers. The signature ΣBV4 includes a single sort BV4 and various function and predicate symbols for reasoning about bitvectors of length 4 (such as & for bit-wise and, constants of the form 0110, etc.). The <sup>Σ</sup>BV4-theory <sup>T</sup>BV4 corresponds to SMT-LIB 2 bit-vectors of size 4, with the expected semantics of constants and operators.

Let <sup>Σ</sup>1, Σ<sup>2</sup> be signatures, <sup>T</sup><sup>1</sup> <sup>a</sup> <sup>Σ</sup>1-theory, and <sup>T</sup><sup>2</sup> <sup>a</sup> <sup>Σ</sup>2-theory. The combination of <sup>T</sup><sup>1</sup> and <sup>T</sup>2, denoted <sup>T</sup>1⊕T2, consists of all <sup>Σ</sup>1∪Σ2-structures <sup>A</sup>, such that <sup>A</sup>Σ<sup>1</sup> is in <sup>T</sup><sup>1</sup> and <sup>A</sup>Σ<sup>2</sup> is in <sup>T</sup>2, where <sup>A</sup>Σ<sup>i</sup> is the reduct of <sup>A</sup> to <sup>Σ</sup><sup>i</sup> for <sup>i</sup> ∈ {1, <sup>2</sup>}.

Example 3. Let TIntBV4 be TInt ⊕ TBV4. It is the combined theory of integers and bit-vectors. It has all the sorts and operators from both theories. If we rename the sorts elem<sup>1</sup> and elem<sup>2</sup> of ΣList to int and BV4, respectively, we can obtain a theory TListIntBV4 defined as TIntBV4 ⊕ TList. This is the theory of lists of pairs, where each pair consists of an integer and a bit-vector of size 4.

The following definitions and theorems will be useful in the sequel.

**Theorem 1 (Theorem 9 of [19]).** Let Σ be a signature, and A a set of Σformulas that is satisfiable. Then there exists an interpretation A that satisfies A, in which σ<sup>A</sup> is countable whenever it is infinite.<sup>5</sup>

**Definition 1 (Arrangement).** Let V be a finite set of variables whose sorts are in <sup>S</sup> and let {V<sup>σ</sup> <sup>|</sup> <sup>σ</sup> <sup>∈</sup> <sup>S</sup>} be a partition of <sup>V</sup> such that <sup>V</sup><sup>σ</sup> is the set of variables of sort σ in V . A formula δ is an arrangement of V if

$$\delta = \bigwedge\_{\sigma \in S} (\bigwedge\_{(x,y) \in E\_{\sigma}} (x = y) \land \bigwedge\_{x,y \in V\_{\sigma}, (x,y) \notin E\_{\sigma}} (x \neq y)) \,,$$

where <sup>E</sup><sup>σ</sup> is some equivalence relation over <sup>V</sup><sup>σ</sup> for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>.

The following theorem from [12] is a variant of a theorem from [20].

**Theorem 2 (Theorem 2.5 of [12]).** For i = 1, 2, let Σ<sup>i</sup> be disjoint signatures, <sup>S</sup><sup>i</sup> <sup>=</sup> <sup>S</sup><sup>Σ</sup><sup>i</sup> with <sup>S</sup> <sup>=</sup> <sup>S</sup><sup>1</sup> <sup>∩</sup> <sup>S</sup>2, <sup>T</sup><sup>i</sup> be a <sup>Σ</sup>i-theory, <sup>Γ</sup><sup>i</sup> be a set of <sup>Σ</sup>i-literals, and <sup>V</sup> <sup>=</sup> vars (Γ1)∩vars (Γ2). If there exist a <sup>T</sup>1-interpretation <sup>A</sup>, a <sup>T</sup><sup>2</sup> interpretation <sup>B</sup>, and an arrangement <sup>δ</sup><sup>V</sup> of <sup>V</sup> such that: 1. A |<sup>=</sup> <sup>Γ</sup><sup>1</sup> <sup>∪</sup>δ<sup>V</sup> ; 2. B |<sup>=</sup> <sup>Γ</sup><sup>2</sup> <sup>∪</sup>δ<sup>V</sup> ; and 3. <sup>|</sup>Aσ<sup>|</sup> <sup>=</sup> <sup>|</sup>Bσ<sup>|</sup> for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>, then <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>Γ</sup><sup>2</sup> is <sup>T</sup><sup>1</sup> ⊕ T2-satisfiable.

#### **2.2 Polite Theories**

We now give the background definitions necessary for both Nelson-Oppen and polite combination. In what follows, Σ is an arbitrary (many-sorted) signature, <sup>S</sup> ⊆ SΣ, and <sup>T</sup> is a <sup>Σ</sup>-theory. We start with stable infiniteness and smoothness.

**Definition 2 (Stably Infinite).** <sup>T</sup> is stably infinite with respect to <sup>S</sup> if every quantifier-free <sup>Σ</sup>-formula that is <sup>T</sup> -satisfiable is also satisfiable in a <sup>T</sup> interpretation <sup>A</sup> in which <sup>σ</sup><sup>A</sup> is infinite for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>.

**Definition 3 (Smooth).** <sup>T</sup> is smooth w.r.t. <sup>S</sup> if for every quantifier-free formula <sup>φ</sup>, <sup>T</sup> -interpretation <sup>A</sup> that satisfies <sup>φ</sup>, and function <sup>κ</sup> from <sup>S</sup> to the class of cardinals such that <sup>κ</sup>(σ) <sup>≥</sup> σA for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>, there exists a <sup>T</sup> -interpretation <sup>A</sup> that satisfies <sup>φ</sup> with σA- <sup>=</sup> <sup>κ</sup>(σ) for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>.

<sup>5</sup> In [19] this was proven more generally, for ordered sorted logics.

We identify singleton sets with their single elements when there is no ambiguity (e.g., when saying that a theory is smooth w.r.t. a sort σ).

We next define politeness and related concepts, following the presentation in [18]. Let <sup>φ</sup> be a quantifier-free <sup>Σ</sup>-formula. A <sup>Σ</sup>-interpretation <sup>A</sup> finitely witnesses <sup>φ</sup> for <sup>T</sup> w.r.t. <sup>S</sup> (or, is a finite witness of <sup>φ</sup> for <sup>T</sup> w.r.t. <sup>S</sup>), if A |<sup>=</sup> <sup>φ</sup> and <sup>σ</sup><sup>A</sup> <sup>=</sup> varsσ(φ)<sup>A</sup> for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>. We say that <sup>φ</sup> is finitely witnessed for <sup>T</sup> w.r.t. <sup>S</sup> if it is either <sup>T</sup> -unsatisfiable or has a finite witness for <sup>T</sup> w.r.t. <sup>S</sup>. We say that <sup>φ</sup> is strongly finitely witnessed for <sup>T</sup> w.r.t. <sup>S</sup> if <sup>φ</sup> <sup>∧</sup> <sup>δ</sup><sup>V</sup> is finitely witnessed for <sup>T</sup> w.r.t. <sup>S</sup> for every arrangement <sup>δ</sup><sup>V</sup> of <sup>V</sup> , where <sup>V</sup> is any set of variables whose sorts are in <sup>S</sup>. A function wit : QF(Σ) <sup>→</sup> QF(Σ) is a (strong) witness for <sup>T</sup> w.r.t. <sup>S</sup> if for every <sup>φ</sup> <sup>∈</sup> QF(Σ) we have that: 1. <sup>φ</sup> and <sup>∃</sup> −→w .wit(φ) are <sup>T</sup> -equivalent for −→<sup>w</sup> <sup>=</sup> vars (wit(φ)) \ vars (φ); and 2. wit(φ) is (strongly) finitely witnessed for <sup>T</sup> w.r.t. <sup>S</sup>. <sup>T</sup> is (strongly) finitely witnessable w.r.t. <sup>S</sup> if there exists a computable (strong) witness for <sup>T</sup> w.r.t. <sup>S</sup>. <sup>T</sup> is (strongly) polite w.r.t. S if it is smooth and (strongly) finitely witnessable w.r.t. S.

# **3 Politeness and Strong Politeness**

In this section, we study the difference between politeness and strong politeness. Since the introduction of strong politeness in [12], it has been unclear whether it is strictly stronger than politeness, that is, whether there exists a theory that is polite but not strongly polite. We present an example of such a theory, answering the open question affirmatively. This result is followed by further analysis of notions related to politeness. This section is organized as follows. In Section 3.1 we reformulate an example given in [12], showing that there are witnesses that are not strong witnesses. We then present a polite theory that is not strongly polite in Section 3.2. The theory is over a signature with two sorts that is otherwise empty. We show in Section 3.3 that politeness and strong politeness are equivalent for empty signatures with a single sort. Finally, we show in Section 3.4 that this equivalence does not hold for finite witnessability alone.

#### **3.1 Witnesses vs. Strong Witnesses**

In [12], an example was given for a witness that is not strong. We reformulate this example in terms of the notions that are defined in the current paper, that is, witnessed formulas are not the same as strongly witnessed formulas (Example 4), and witnesses are not the same as strong witnesses (Example 5).

Example 4. Let Σ<sup>0</sup> be a signature with a single sort σ and no function or predicate symbols, and let <sup>T</sup><sup>0</sup> be a <sup>Σ</sup>0-theory consisting of all <sup>Σ</sup>0-structures with at least two elements. Let <sup>φ</sup> be the formula <sup>x</sup> <sup>=</sup> <sup>x</sup> <sup>∧</sup> <sup>w</sup> <sup>=</sup> <sup>w</sup>. This formula is finitely witnessed for <sup>T</sup><sup>0</sup> w.r.t. <sup>σ</sup>, but not strongly. Indeed, for <sup>δ</sup><sup>V</sup> <sup>≡</sup> (<sup>x</sup> <sup>=</sup> <sup>w</sup>), <sup>φ</sup> <sup>∧</sup> <sup>δ</sup><sup>V</sup> is not finitely witnessed for <sup>T</sup><sup>0</sup> w.r.t. <sup>σ</sup>: a finite witness would be required to have only a single element and would therefore not be a T0-interpretation.

The next example shows that witnesses and strong witnesses are not equivalent.

Example 5. Take <sup>Σ</sup>0, <sup>σ</sup>, and <sup>T</sup><sup>0</sup> as in Example 4, and define wit(φ) as the function (<sup>φ</sup> <sup>∧</sup> <sup>w</sup><sup>1</sup> <sup>=</sup> <sup>w</sup><sup>1</sup> <sup>∧</sup> <sup>w</sup><sup>2</sup> <sup>=</sup> <sup>w</sup>2) for fresh <sup>w</sup>1, w2. The function is a witness for <sup>T</sup><sup>0</sup> w.r.t. <sup>σ</sup>. However, it is not a strong witness for <sup>T</sup> w.r.t. <sup>σ</sup>.

Although the theory T<sup>0</sup> in the above examples does serve to distinguish formulas and witnesses that are and are not strong, it cannot be used to do the same for theories themselves. This is because T<sup>0</sup> is, in fact, strongly polite, via a different witness function.

Example 6. The function wit (φ)=(<sup>φ</sup> <sup>∧</sup> <sup>w</sup><sup>1</sup> <sup>=</sup> <sup>w</sup>2), for some <sup>w</sup>1, w<sup>2</sup> <sup>∈</sup>/ varsσ(φ), is a strong witness for <sup>T</sup><sup>0</sup> w.r.t. <sup>S</sup>, as proved in [12].

A natural question, then, is whether there is a theory that can separate the two notions of politeness. The following subsection provides an affirmative answer.

#### **3.2 A Polite Theory that is not Strongly Polite**

Let Σ<sup>2</sup> be a signature with two sorts σ<sup>1</sup> and σ<sup>2</sup> and no function or predicate symbols (except =). Let <sup>T</sup><sup>2</sup>,<sup>3</sup> be the <sup>Σ</sup>2-theory from [8], consisting of all <sup>Σ</sup>2 structures <sup>A</sup> such that either σA 1 = 2 <sup>∧</sup> σA 2 ≥ ℵ<sup>0</sup> or σA 1 <sup>≥</sup> <sup>3</sup> <sup>∧</sup> σA 2 <sup>≥</sup> 3 [8].<sup>6</sup>

T<sup>2</sup>,<sup>3</sup> is polite, but is not strongly polite. Its smoothness is shown by extending any given structure with new elements as much as necessary.

**Lemma 1.** <sup>T</sup><sup>2</sup>,<sup>3</sup> is smooth w.r.t. {σ1, σ2}.

For finite witnessability, consider the function wit defined as follows:

$$\text{wift}(\phi) := \phi \land x\_1 = x\_1 \land x\_2 = x\_2 \land x\_3 = x\_3 \land y\_1 = y\_1 \land y\_2 = y\_2 \land y\_3 = y\_3 \quad \text{(1)}$$

for fresh variables x1, x2, and x<sup>3</sup> of sort σ<sup>1</sup> and y1, y2, and y<sup>3</sup> of sort σ2. It can be shown that wit is a witness for T<sup>2</sup>,<sup>3</sup> but there is no strong witness for it.

**Lemma 2.** <sup>T</sup><sup>2</sup>,<sup>3</sup> is finitely witnessable w.r.t. {σ1, σ2}.

# **Lemma 3.** <sup>T</sup><sup>2</sup>,<sup>3</sup> is not strongly finitely witnessable w.r.t. {σ1, σ2}.

Lemmas 1 to 3 have shown that T<sup>2</sup>,<sup>3</sup> is polite but is not strongly polite. And indeed, using the polite combination method from [12] with this theory can cause problems. Consider the theory <sup>T</sup><sup>1</sup>,<sup>1</sup> that consists of all <sup>Σ</sup>2-structures <sup>A</sup> such that σA 1 <sup>=</sup> σA 2 = 1. Clearly, <sup>T</sup><sup>1</sup>,<sup>1</sup>⊕T<sup>2</sup>,<sup>3</sup> is empty, and hence no formula is <sup>T</sup><sup>1</sup>,<sup>1</sup>⊕T<sup>2</sup>,<sup>3</sup> satisfiable. However, denote the formula true by Γ<sup>1</sup> and the formula x = x by <sup>Γ</sup><sup>2</sup> for some variable <sup>x</sup> of sort <sup>σ</sup>1. Then wit(Γ2) is <sup>x</sup> <sup>=</sup> <sup>x</sup><sup>∧</sup> <sup>3</sup> <sup>i</sup>=1 <sup>x</sup><sup>i</sup> <sup>=</sup> <sup>x</sup><sup>i</sup> <sup>∧</sup>y<sup>i</sup> <sup>=</sup> <sup>y</sup>i. Let <sup>δ</sup> be the arrangement <sup>x</sup> <sup>=</sup> <sup>x</sup><sup>1</sup> <sup>=</sup> <sup>x</sup><sup>2</sup> <sup>=</sup> <sup>x</sup>3∧y<sup>1</sup> <sup>=</sup> <sup>y</sup><sup>2</sup> <sup>=</sup> <sup>y</sup>3. It can be shown that wit(Γ2)∧<sup>δ</sup> is <sup>T</sup><sup>2</sup>,<sup>3</sup>-satisfiable and <sup>Γ</sup>1∧<sup>δ</sup> is <sup>T</sup><sup>1</sup>,<sup>1</sup>-satisfiable. Hence the combination method of [12] would consider <sup>Γ</sup><sup>1</sup> <sup>∧</sup> <sup>Γ</sup><sup>2</sup> to be <sup>T</sup><sup>1</sup>,<sup>1</sup> ⊕ T<sup>2</sup>,<sup>3</sup>-satisfiable, which is impossible. Hence the fact that T<sup>2</sup>,<sup>3</sup> is not strongly polite propagates all the way to the polite combination method.<sup>7</sup>

<sup>6</sup> In [8], the first condition is written - σ<sup>A</sup> 1 - - ≥ 2. We use equality as this is equivalent and we believe it makes things clearer.

<sup>7</sup> Notice that <sup>T</sup><sup>2</sup>,<sup>3</sup> can be axiomatized using the following set of axioms, given the definitions in Figure 1: ψ<sup>σ</sup><sup>1</sup> ≥<sup>2</sup>, ψ<sup>σ</sup><sup>2</sup> ≥3 ∪ {ψ<sup>σ</sup><sup>1</sup> =2 → ¬ψ<sup>σ</sup><sup>2</sup> <sup>=</sup><sup>n</sup> <sup>|</sup> <sup>n</sup> <sup>≥</sup> <sup>3</sup>}

$$\begin{aligned} distinct(x\_1, \ldots, x\_n) &:= \bigwedge\_{1 \le i < j < n} x\_i \neq x\_j \\ \psi\_{\ge n}^{\sigma} &:= \exists x\_1, \ldots, x\_n. distinct(x\_1, \ldots, x\_n) \\ \psi\_{\le n}^{\sigma} &:= \exists x\_1, \ldots, x\_n. \forall y. \bigvee\_{i=1}^n y = x\_i \\ \psi\_{= n}^{\sigma} &:= \psi\_{\ge n}^{\sigma} \land \psi\_{\le n}^{\sigma} \end{aligned}$$

**Fig. 1.** Cardinality formulas for sort <sup>σ</sup>. All variables are assumed to have sort <sup>σ</sup>.

Remark 1. An alternative way to separate politeness from strong politeness using T<sup>2</sup>,<sup>3</sup> can be obtained through shiny theories, as follows. Shiny theories were introduced in [21] for the mono-sorted case, and were generalized to many-sorted signatures in two different ways in [8] and [17]. In [8], T2,<sup>3</sup> was introduced as a theory that is shiny according [17], but not according to [8]. Theorem 1 of [8] states that their notion of shininess is equivalent to strong politeness for theories in which the satisfiability problem for quantifier-free formulas is decidable. Since this is the case for T2,<sup>3</sup>, and since it is not shiny according to [8], we get that T2,<sup>3</sup> is not strongly polite. Further, Proposition 18 of [17] states that every shiny theory (according to their definition) is polite. Hence we get that T2,<sup>3</sup> is polite but not strongly polite.

We have (and prefer) a direct proof based only on politeness, without a detour through shininess. Note also that [8] dealt only with strongly polite theories and did not study the weaker notion of polite theories. In particular, the fact that strong politeness is different from politeness was not stated nor proved there.

#### **3.3 The Case of Mono-sorted Polite Theories**

Theory T<sup>2</sup>,<sup>3</sup> includes two sorts but is otherwise empty. In this section, we show that requiring two sorts is essential for separating politeness from strong politeness in otherwise empty signatures. That is, we prove that politeness implies strong politeness otherwise. Let Σ<sup>0</sup> be the signature with a single sort σ and no function or predicate symbols (except =). We show that smooth Σ0-theories have a certain form and conclude strong politeness from politeness.

**Lemma 4.** Let <sup>T</sup> be a <sup>Σ</sup>0-theory. If <sup>T</sup> is smooth w.r.t. <sup>σ</sup> and includes a finite structure, <sup>T</sup> is axiomatized by <sup>ψ</sup><sup>σ</sup> <sup>≥</sup><sup>n</sup> from Figure <sup>1</sup> for some n > <sup>0</sup>.

**Proposition 1.** If <sup>T</sup> is a <sup>Σ</sup>0-theory that is polite w.r.t. <sup>σ</sup>, then it is strongly polite w.r.t. σ.

Remark 2. We again note (as we did in Remark 1) that an alternative way to obtain this result is via shiny theories, using [17], which introduced polite theories, as well as [7], which compared strongly polite theories to shiny theories

in the mono-sorted case. Specifically, in the presence of a single sort, Proposition 19 of [17] states that:

(∗) if the question of whether a polite theory over a finite signature contains a finite structure is decidable, the theory is shiny.

In turn, Proposition 1 of [7] states that:

(∗∗) every shiny theory over a mono-sorted signature with a decidable sat-

isfiability problem for quantifier-free formulas is also strongly polite. It can be shown that the question of whether a polite Σ0-theory contains a finite structure is decidable. It can also be shown that satisfiability of quantifier-free formulas is decidable for such theories. Using (∗) and (∗∗), we get that in <sup>Σ</sup>0 theories, politeness implies strong politeness. As above (Remark 1), we prefer a direct route for showing this result, without going through shiny theories.

#### **3.4 Mono-sorted Finite Witnessability**

We have seen that for Σ0-theories, politeness and strong politeness are the same. Now we show that smoothness is crucial for this equivalence, i.e., that there is no such equivalence between finite witnessability and strong finite witnessability. Let T <sup>∞</sup> Even be the <sup>Σ</sup>0-theory of all <sup>Σ</sup>0-structures <sup>A</sup> such that σA is even or infinite.<sup>8</sup> Clearly, this theory is not smooth.

**Lemma 5.** T <sup>∞</sup> Even is not smooth w.r.t. σ.

We can construct a witness wit for T <sup>∞</sup> Even as follows. Let φ be a quantifier-free <sup>Σ</sup>0-formula, and let <sup>E</sup> be the set of all equivalence relations over vars (φ) ∪ {w} for some fresh variable w. Let even(E) be the set of all equivalence relations in <sup>E</sup> with an even number of equivalence classes. Then, wit(φ) is <sup>φ</sup><sup>∧</sup> <sup>e</sup>∈even(E) <sup>δ</sup>e, where for each <sup>e</sup> <sup>∈</sup> even(E), <sup>δ</sup><sup>e</sup> is the arrangement induced by <sup>e</sup>:

$$\bigwedge\_{(x,y)\in e} x = y \land \bigwedge\_{x,y \in vars \,(\phi)\cup\{w\}\land(x,y)\notin e} x \neq y$$

It can be shown that wit is indeed a witness, and that T <sup>∞</sup> Even has no strong witness, with a proof similar to that of Lemma 3.

**Lemma 6.** T <sup>∞</sup> Even is finitely witnessable w.r.t. σ.

**Lemma 7.** T <sup>∞</sup> Even is not strongly finitely witnessable w.r.t. σ.

#### **4 A Blend of Polite and Stably-Infinite Theories**

In this section, we show that the polite combination method can be optimized to reduce the search space of possible arrangements. In what follows, Σ<sup>1</sup> and Σ<sup>2</sup> are disjoint signatures, <sup>S</sup> <sup>=</sup> <sup>S</sup><sup>Σ</sup><sup>1</sup> ∩ S<sup>Σ</sup><sup>2</sup> , <sup>T</sup><sup>1</sup> is a <sup>Σ</sup>1-theory, <sup>T</sup><sup>2</sup> is a <sup>Σ</sup>2-theory, <sup>Γ</sup><sup>1</sup> is a set of Σ1-literals, and Γ<sup>2</sup> is a set of Σ2-literals.

<sup>8</sup> Notice that <sup>T</sup> <sup>∞</sup> Even can be axiomatized using the set {¬ψ<sup>σ</sup> =2n+1 <sup>|</sup> <sup>n</sup> <sup>∈</sup> <sup>N</sup>}.

The Nelson-Oppen procedure reduces the <sup>T</sup><sup>1</sup> ⊕ T2-satisfiability of <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>Γ</sup><sup>2</sup> to the existence of an arrangement <sup>δ</sup> over the set <sup>V</sup> <sup>=</sup> varsS(Γ1) <sup>∩</sup> varsS(Γ2), such that <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>δ</sup> is <sup>T</sup>1-satisfiable and <sup>Γ</sup><sup>2</sup> <sup>∪</sup> <sup>δ</sup> is <sup>T</sup>2-satisfiable. The correctness of this reduction relies on the fact that both theories are stably infinite w.r.t. S. In contrast, the polite combination method only requires a condition (namely strong politeness) from one of the theories, while the other theory is unrestricted and, in particular, not necessarily stably infinite. In polite combination, the T<sup>1</sup> ⊕ T2 satisfiability of <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>Γ</sup><sup>2</sup> is again reduced to the existence of an arrangement <sup>δ</sup>, but over a different set <sup>V</sup> <sup>=</sup> varsS(wit(Γ2)), such that <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>δ</sup> is <sup>T</sup>1-satisfiable and wit(Γ2) <sup>∪</sup> <sup>δ</sup> is <sup>T</sup>2-satisfiable, where wit is a strong witness for <sup>T</sup><sup>2</sup> w.r.t. <sup>S</sup>. Thus, the flexibility offered by polite combination comes with a price. The set V is potentially larger than V as it contains all variables with sorts in S that occur in wit(Γ2), not just those that also occur in Γ1. Since the search space of arrangements over a set grows exponentially with its size, this difference can become crucial. If <sup>T</sup><sup>1</sup> happens to be stably infinite w.r.t. <sup>S</sup>, however, we can fall back to Nelson-Oppen combination and only consider variables that are shared by the two sets. But what if T<sup>1</sup> is stably infinite only w.r.t. to some proper subset <sup>S</sup> <sup>⊂</sup> <sup>S</sup>? Can this knowledge about <sup>T</sup><sup>1</sup> help in finding some set <sup>V</sup> of variables between V and V , such that we need only consider arrangements of V ? In this section we prove that this is possible by taking V to include only the variables of sorts in S that are shared between Γ<sup>1</sup> and wit(Γ2), and all the variables of sorts in <sup>S</sup> \ <sup>S</sup> that occur in wit(Γ2). We also identify several weaker conditions on T<sup>2</sup> that are sufficient for the combination theorem to hold.

#### **4.1 Refined Combination Theorem**

To put the discussion above in formal terms, we recall the following theorem.

**Theorem 3 ([12]).** If <sup>T</sup><sup>2</sup> is strongly polite w.r.t. <sup>S</sup> with a witness wit , then the following are equivalent: 1. <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>Γ</sup><sup>2</sup> is (T<sup>1</sup> <sup>⊕</sup> <sup>T</sup>2)-satisfiable; 2. there exists an arrangement <sup>δ</sup><sup>V</sup> over <sup>V</sup> , such that <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> is <sup>T</sup>1-satisfiable and wit(Γ2) <sup>∪</sup> <sup>δ</sup><sup>V</sup> is <sup>T</sup>2-satisfiable, where <sup>V</sup> <sup>=</sup> <sup>σ</sup>∈<sup>S</sup> <sup>V</sup>σ, and <sup>V</sup><sup>σ</sup> <sup>=</sup> varsσ(wit(Γ2)) for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>.

Our goal is to identify general cases in which information regarding T<sup>1</sup> can help reduce the size of the set V . We extend the definitions of stably infinite, smooth, and strongly finitely witnessable to two sets of sorts rather than one. Roughly speaking, in this extension, the usual definition is taken for the first set, and some cardinality-preserving constraints are enforced on the second set.

**Definition 4.** Let <sup>Σ</sup> be a signature, <sup>S</sup>1, S<sup>2</sup> two disjoint subsets of <sup>S</sup>Σ, and <sup>T</sup> <sup>a</sup> Σ-theory.

<sup>T</sup> is (strongly) stably infinite w.r.t. (S1, S2) if for every quantifier-free <sup>Σ</sup>formula <sup>φ</sup> and <sup>T</sup> -interpretation <sup>A</sup> satisfying <sup>φ</sup>, there exists a <sup>T</sup> -interpretation <sup>B</sup> such that B |<sup>=</sup> <sup>φ</sup>, <sup>|</sup>σ<sup>B</sup><sup>|</sup> is infinite for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>1, and <sup>|</sup>σ<sup>B</sup>|≤|σ<sup>A</sup><sup>|</sup> (|σ<sup>B</sup><sup>|</sup> <sup>=</sup> <sup>|</sup>σ<sup>A</sup>|) for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>2.

<sup>T</sup> is smooth w.r.t. (S1, S2) if for every quantifier-free <sup>Σ</sup>-formula <sup>φ</sup>, <sup>T</sup> interpretation <sup>A</sup> satisfying <sup>φ</sup>, and function <sup>κ</sup> from <sup>S</sup><sup>1</sup> to the class of cardinals such that <sup>κ</sup>(σ) <sup>≥</sup> σA for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>1, there exists a <sup>T</sup> -interpretation <sup>B</sup> that satisfies φ, with σB <sup>=</sup> <sup>κ</sup>(σ) for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>1, and with σB infinite whenever σA is infinite for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>2.

<sup>T</sup> is strongly finitely witnessable w.r.t. (S1, S2) if there exists a computable function wit : QF(Σ) <sup>→</sup> QF(Σ) such that for every quantifier-free <sup>Σ</sup>-formula <sup>φ</sup>: 1. <sup>φ</sup> and <sup>∃</sup> −→w . wit(φ) are <sup>T</sup> -equivalent for −→<sup>w</sup> <sup>=</sup> vars (wit(φ)) \ vars (φ); and 2. for every <sup>T</sup> -interpretation <sup>A</sup> and arrangement <sup>δ</sup> of any set of variables whose sorts are in <sup>S</sup>1, if <sup>A</sup> satisfies wit(φ) <sup>∧</sup> <sup>δ</sup>, then there exists a <sup>T</sup> -interpretation <sup>B</sup> that finitely witnesses wit(φ)∧<sup>δ</sup> w.r.t. <sup>S</sup><sup>1</sup> and for which σB is infinite whenever σA is infinite, for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>2.

Our main result is the following.

**Theorem 4.** Let <sup>S</sup>si <sup>⊆</sup> <sup>S</sup> and <sup>S</sup>nsi <sup>=</sup> <sup>S</sup> \ <sup>S</sup>si. Suppose <sup>T</sup><sup>1</sup> is stably infinite w.r.t. Ssi and one of the following holds:


Then the following are equivalent: 1. <sup>Γ</sup>1∪Γ<sup>2</sup> is (T1⊕T2)-satisfiable; 2. There exists an arrangement <sup>δ</sup><sup>V</sup> over <sup>V</sup> such that <sup>Γ</sup>1∪δ<sup>V</sup> is <sup>T</sup>1-satisfiable, and wit(Γ2)∪δ<sup>V</sup> is <sup>T</sup>2-satisfiable, where <sup>V</sup> <sup>=</sup> <sup>σ</sup>∈<sup>S</sup> <sup>V</sup>σ, with <sup>V</sup><sup>σ</sup> <sup>=</sup> varsσ(wit(Γ2)) for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>nsi and <sup>V</sup><sup>σ</sup> <sup>=</sup> varsσ(Γ1) <sup>∩</sup> varsσ(wit(Γ2)) for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>si.

All three items of Theorem 4 include assumptions that guarantee that the two theories agree on cardinalities of shared sorts. For example, in the first item, we first shrink the Snsi-domains of the T2-model using strong finite witnessability, and then expand them using smoothness. But then, to obtain infinite domains for the Ssi sorts, stable infiniteness is not enough, as we need to maintain the cardinalities of the Snsi domains while making the domains of the Ssi sorts infinite. For this, the stronger property of strong stable infiniteness is used.

The formal proof of this theorem is provided in Section 4.2, below. Figure 2 is a visualization of the claims in Theorem 4. The theorem considers two variants of strong finite witnessability, two variants of smoothness, and three variants of stable infiniteness. For each of the three cases of Theorem 4, Figure 2 shows which variant of each property is assumed. The height of each bar corresponds to the strength of the property. In the first case, we use ordinary strong finite witnessability and smoothness, but the strongest variant of stable infiniteness; in the second, we use ordinary strong finite witnessability with the new variants of stable infiniteness and smoothness; and for the third, we use ordinary stable infiniteness and the stronger variants of strong finite witnessability and smoothness. The order of the bars corresponds to the order of their usage in the proof of each case. The stage at which stable infiniteness is used determines the required

strength of the other properties: whatever is used before is taken in ordinary form, and whatever is used after requires a stronger form.

property. The bars are ordered according to their usage in the proof.

Going back to the standard definitions of stable infiniteness, smoothness, and strong finite witnessability, we get the following corollary by using case 1 of the theorem and noticing that smoothness w.r.t. S implies strong stable infiniteness w.r.t. any partition of S.

**Corollary 1.** Let <sup>S</sup>si <sup>⊆</sup> <sup>S</sup> and <sup>S</sup>nsi <sup>=</sup> <sup>S</sup> \ <sup>S</sup>si. Suppose <sup>T</sup><sup>1</sup> is stably infinite w.r.t. <sup>S</sup>si and <sup>T</sup><sup>2</sup> is strongly finitely witnessable w.r.t. <sup>S</sup>nsi with witness wit and smooth w.r.t. S. Then, the following are equivalent:

1. <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>Γ</sup><sup>2</sup> is (T<sup>1</sup> ⊕ T2)-satisfiable; 2. there exists an arrangement <sup>δ</sup><sup>V</sup> over <sup>V</sup> such that <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> is <sup>T</sup>1-satisfiable and wit(Γ2) <sup>∪</sup> <sup>δ</sup><sup>V</sup> is <sup>T</sup>2-satisfiable, where V = <sup>σ</sup>∈<sup>S</sup> <sup>V</sup>σ, with <sup>V</sup><sup>σ</sup> <sup>=</sup> varsσ(wit(Γ2)) for <sup>σ</sup> <sup>∈</sup> <sup>S</sup>nsi and <sup>V</sup><sup>σ</sup> <sup>=</sup> varsσ(Γ1) <sup>∩</sup> varsσ(wit(Γ2)) for <sup>σ</sup> <sup>∈</sup> <sup>S</sup>si.

Finally, the following result, which is closest to Theorem 3, is directly obtained from Corollary 1, since the strong politeness of <sup>T</sup><sup>2</sup> w.r.t. <sup>S</sup>si <sup>∪</sup>Snsi implies that it is strongly finitely witnessable w.r.t. <sup>S</sup>nsi and smooth w.r.t. <sup>S</sup>si <sup>∪</sup> <sup>S</sup>nsi.

**Corollary 2.** Let <sup>S</sup>si <sup>⊆</sup> <sup>S</sup> and <sup>S</sup>nsi <sup>=</sup> <sup>S</sup> \ <sup>S</sup>si. If <sup>T</sup><sup>1</sup> is stably infinite w.r.t. <sup>S</sup>si and <sup>T</sup><sup>2</sup> is strongly polite w.r.t. <sup>S</sup> with a witness wit , then the following are equivalent: 1. <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>Γ</sup><sup>2</sup> is (T<sup>1</sup> ⊕ T2)-satisfiable; 2. there exists an arrangement <sup>δ</sup><sup>V</sup> over <sup>V</sup> such that <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> is <sup>T</sup>1-satisfiable and wit(Γ2) <sup>∪</sup> <sup>δ</sup><sup>V</sup> is <sup>T</sup>2-satisfiable, where V = <sup>σ</sup>∈<sup>S</sup> <sup>V</sup>σ, with <sup>V</sup><sup>σ</sup> <sup>=</sup> varsσ(wit(Γ2)) for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>nsi and <sup>V</sup><sup>σ</sup> <sup>=</sup> varsσ(Γ1) <sup>∩</sup> varsσ(wit(Γ2)) for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>si.

Compared to Theorem 3, Corollary 2 partitions S into Ssi and Snsi and requires that <sup>T</sup><sup>1</sup> be stably infinite w.r.t. <sup>S</sup>si. The gain from this requirement is that the set <sup>V</sup><sup>σ</sup> is potentially reduced for <sup>σ</sup> <sup>∈</sup> <sup>S</sup>si. Note that unlike Theorem <sup>4</sup> and Corollary 1, Corollary 2 has the same assumptions regarding T<sup>2</sup> as the original Theorem 3 from [12]. We show its potential impact in the next example.

Example 7. Consider the theory <sup>T</sup>ListIntBV4 from Example 3. Let <sup>Γ</sup><sup>1</sup> be <sup>x</sup> <sup>=</sup> <sup>5</sup> <sup>∧</sup> <sup>v</sup> = 0000 <sup>∧</sup> <sup>w</sup> <sup>=</sup> <sup>w</sup> & <sup>v</sup>, and let <sup>Γ</sup><sup>2</sup> be <sup>a</sup><sup>0</sup> <sup>=</sup> cons(x, v, a1) <sup>∧</sup> <sup>n</sup> <sup>i</sup>=1 <sup>a</sup><sup>i</sup> <sup>=</sup> cons(yi, w, ai+1). Using the witness function wit from [18], wit(Γ2) = Γ2. The polite combination approach reduces the <sup>T</sup>ListIntBV4-satisfiability of <sup>Γ</sup><sup>1</sup> <sup>∧</sup> <sup>Γ</sup><sup>2</sup> to the existence of an arrangement <sup>δ</sup> over {x, v, w}∪{y1,...,yn}, such that <sup>Γ</sup>1∧<sup>δ</sup> is <sup>T</sup>IntBV4-satisfiable and wit(Γ2) <sup>∧</sup> <sup>δ</sup> is <sup>T</sup>List-satisfiable. Corollary <sup>2</sup> shows that we can do better. Since TIntBV4 is stably infinite w.r.t. {int}, it is enough to check the existence of an arrangement over the variables of sort BV4 that occur in wit(Γ2), together with the variables of sort int that are shared between Γ<sup>1</sup> and <sup>Γ</sup>2. This means that arrangements over {x, v, w} are considered, instead of over {x, v, w}∪{y1,...,yn}. As <sup>n</sup> becomes large, standard polite combination requires considering exponentially more arrangements, while the number of arrangements considered by our combination method remains the same.

#### **4.2 Proof of Theorem 4**

The left-to-right direction is straightforward, using the reducts of the satisfying interpretation of <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>Γ</sup><sup>2</sup> to <sup>Σ</sup><sup>1</sup> and <sup>Σ</sup>2. We now focus on the right-to-left direction, and begin with the following lemma, which strengthens Theorem 1, obtaining a many-sorted L¨owenheim-Skolem Theorem, where the cardinality of the finite sorts remains the same.

**Lemma 8.** Let <sup>Σ</sup> be a signature, <sup>T</sup> <sup>a</sup> <sup>Σ</sup>-theory, <sup>ϕ</sup> <sup>a</sup> <sup>Σ</sup>-formula, and <sup>A</sup> <sup>a</sup> <sup>T</sup> interpretation that satisfies <sup>φ</sup>. Let <sup>S</sup><sup>Σ</sup> <sup>=</sup> <sup>S</sup>fin <sup>A</sup> <sup>S</sup>inf <sup>A</sup> , where <sup>σ</sup><sup>A</sup> is finite for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>fin <sup>A</sup> and <sup>σ</sup><sup>A</sup> is infinite for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>inf <sup>A</sup> . Then there exists a <sup>T</sup> interpretation <sup>B</sup> that satisfies <sup>ϕ</sup> such that σB <sup>=</sup> σA for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>fin <sup>A</sup> and <sup>σ</sup><sup>B</sup> is countable for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>inf <sup>A</sup> .

The proof of Theorem 4 continues with the following main lemma.

**Lemma 9 (Main Lemma).** Let <sup>S</sup>si <sup>⊆</sup> <sup>S</sup> and <sup>S</sup>nsi <sup>=</sup> <sup>S</sup> \ <sup>S</sup>si, Suppose <sup>T</sup><sup>1</sup> is stably infinite w.r.t. Ssi and that one of the three cases of Theorem 4 holds. Further, assume there exists an arrangement <sup>δ</sup><sup>V</sup> over <sup>V</sup> such that <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> is <sup>T</sup>1-satisfiable, and wit(Γ2) <sup>∪</sup> <sup>δ</sup><sup>V</sup> is <sup>T</sup>2-satisfiable, where <sup>V</sup> <sup>=</sup> <sup>σ</sup>∈<sup>S</sup> <sup>V</sup>σ, with <sup>V</sup><sup>σ</sup> <sup>=</sup> varsσ(wit(Γ2)) for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>nsi and <sup>V</sup><sup>σ</sup> <sup>=</sup> varsσ(Γ1) <sup>∩</sup> varsσ(wit(Γ2)) for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>si. Then, there is a <sup>T</sup>1-interpretation <sup>A</sup> that satisfies <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> and a <sup>T</sup>2 interpretation <sup>B</sup> that satisfies wit(Γ2) <sup>∪</sup> <sup>δ</sup><sup>V</sup> such that σA <sup>=</sup> σB for all <sup>σ</sup> <sup>∈</sup> <sup>S</sup>.

Proof : Let <sup>ψ</sup><sup>2</sup> := wit(Γ2). Since <sup>T</sup><sup>1</sup> is stably infinite w.r.t. <sup>S</sup>si, there is a <sup>T</sup>1 interpretation <sup>A</sup> satisfying <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> in which <sup>σ</sup><sup>A</sup> is infinite for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>si. By Theorem 1, we may assume that <sup>σ</sup><sup>A</sup> is countable for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>si. We consider the first case of Theorem 4 (the others are omitted due to space constraints). Suppose <sup>T</sup><sup>2</sup> is strongly stably infinite w.r.t. (Ssi, Snsi) and strongly polite w.r.t. <sup>S</sup>nsi. Since <sup>T</sup><sup>2</sup> is strongly finitely-witnessable w.r.t. <sup>S</sup>nsi, there exists a <sup>T</sup>2-interpretation <sup>B</sup> that satisfies <sup>ψ</sup><sup>2</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> such that <sup>σ</sup><sup>B</sup> <sup>=</sup> <sup>V</sup> <sup>B</sup> <sup>σ</sup> for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>nsi. Since <sup>A</sup> and <sup>B</sup> satisfy <sup>δ</sup><sup>V</sup> , we have that for every <sup>σ</sup> <sup>∈</sup> <sup>S</sup>nsi, σB <sup>=</sup> V B σ <sup>=</sup> <sup>V</sup> <sup>A</sup> σ <sup>≤</sup> σA . <sup>T</sup><sup>2</sup> is also smooth w.r.t. <sup>S</sup>nsi, and so there exists a <sup>T</sup>2-interpretation <sup>B</sup> satisfying <sup>ψ</sup><sup>2</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> such that σB- <sup>=</sup> σA for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>nsi. Finally, <sup>T</sup><sup>2</sup> is strongly stably infinite w.r.t. (Ssi, Snsi), so there is a <sup>T</sup>2 interpretation <sup>B</sup> that satisfies <sup>ψ</sup><sup>2</sup> <sup>∪</sup>δ<sup>V</sup> such that <sup>σ</sup>B- is infinite for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>si and σB-- = σB- <sup>=</sup> σA for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>nsi. By Lemma 8, we may assume that σB- is countable for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>si. Thus, σB-- <sup>=</sup> σA for each <sup>σ</sup> <sup>∈</sup> <sup>S</sup>.

We now conclude Theorem 4: Let T := T<sup>1</sup> ⊕ T2. Lemma 9 gives us a T<sup>1</sup> interpretation <sup>A</sup> with A |<sup>=</sup> <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> and a <sup>T</sup><sup>2</sup> interpretation <sup>B</sup> with B |<sup>=</sup> <sup>ψ</sup><sup>2</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> , and σA <sup>=</sup> σB for <sup>σ</sup> <sup>∈</sup> <sup>S</sup>. Set <sup>Γ</sup> <sup>1</sup> := <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> and <sup>Γ</sup> <sup>2</sup> := <sup>ψ</sup><sup>2</sup> <sup>∪</sup> <sup>δ</sup><sup>V</sup> . Then, <sup>V</sup><sup>σ</sup> <sup>=</sup> varsσ(Γ 1)∩varsσ(Γ <sup>2</sup>) for <sup>σ</sup> <sup>∈</sup> <sup>S</sup>. Now, A |<sup>=</sup> <sup>Γ</sup> <sup>1</sup>∪δ<sup>V</sup> and B |<sup>=</sup> <sup>Γ</sup> <sup>2</sup>∪δ<sup>V</sup> . Also, <sup>|</sup>σ<sup>A</sup><sup>|</sup> <sup>=</sup> <sup>|</sup>σ<sup>B</sup><sup>|</sup> for <sup>σ</sup> <sup>∈</sup> <sup>S</sup>. By Theorem 2, <sup>Γ</sup> <sup>1</sup> <sup>∪</sup>Γ <sup>2</sup> is <sup>T</sup> -satisfiable. In particular, <sup>Γ</sup><sup>1</sup> ∪ {ψ2} is <sup>T</sup> -satisfiable, and hence also <sup>Γ</sup><sup>1</sup> ∪ {∃w.ψ2}, with <sup>w</sup> <sup>=</sup> vars (wit(Γ2)) \ vars (Γ2). Finally, <sup>∃</sup>w.wit(Γ2) is <sup>T</sup>2-equivalent to <sup>Γ</sup>2, hence <sup>Γ</sup><sup>1</sup> <sup>∪</sup> <sup>Γ</sup><sup>2</sup> is <sup>T</sup> -satisfiable.

## **5 Preliminary Case Study**

The results presented in Section 4 was motivated by a set of smart contract verification benchmarks. We obtained these benchmarks by applying the opensource Move Prover verifier [22] to smart contracts found in the open-source Diem project [9]. The Move prover is a formal verifier for smart contracts written in the Move language [6] and was designed to target smart contracts used in the Diem blockchain [1]. It works via a translation to the Boogie verification framework [14], which in turn produces SMT-LIB 2 benchmarks that are dispatched to SMT solvers. The benchmarks we obtained involve datatypes, integers, Booleans, and quantifiers. Our case study began by running CVC4 [3] on the benchmarks. For most of the benchmarks that were solved by CVC4, theory combination took a small percentage of the overall runtime of the solver, accounting for 10% or less in all but 1 benchmark. However, solving that benchmark took 81 seconds, of which 20 seconds was dedicated to theory combination.

We implemented an optimization to the datatype solver of CVC4 based on Corollary 2. With the original polite combination method, every term that originates from the theory of datatypes with another sort is shared with the other theories, triggering an analysis of the arrangements of these terms. In our optimization, we limit the sharing of such terms to those of Boolean sort. In the language of Corollary 2, T<sup>1</sup> is the combined theory of Booleans, uninterpreted functions, and integers, which is stably infinite w.r.t. the uninterpreted sorts and integer sorts. T<sup>2</sup> is an instance of the theory of datatypes, which is strongly polite w.r.t. its element sorts, which in this case are the sorts of T1.

A comparison of an original and optimized run on the difficult benchmark is shown in Figure 3. As shown, the optimization reduces the total running time by 75%, and the time spent on theory combination in particular by 83%. To further isolate the effectiveness of our optimization, we report the number of terms that each theory solver considered. In CVC4, constraints are not flattened, so shared terms are processed instead of shared variables. Each theory solver


**Fig. 3.** Runtimes (in seconds) and number of terms (in thousands) added to the data structures of DT, INT, UFB, and the number of shared terms (shared).

maintains its own data structure for tracking equality information. These data structures contain terms belonging to the theory that either come from the input assertions or are shared with another theory. A data structure is also maintained that contains all shared terms belonging to any theory. The last 4 columns of Figure 3 count the number of times (in thousands) a term was added to the equality data structure for the theory of datatypes (DT), integers (INT), and uninterpreted functions and Booleans (UFB), as well as to the the shared term data structure (shared). With the optimization, the datatype solver keeps more inferred assertions internally, which leads to an increase in the number of additions of terms to its data structure. However, sharing fewer terms, reduces the number of terms in the data structures for the other theories. Moreover, while the total number of terms considered remains roughly the same, the number of shared terms decreases by 24%. This suggests that although the workload on the individual theory solvers is roughly similar, a decrease in the number of shared terms in the optimized run results in a significant improvement in the overall runtime. Although our evidence is only anecdotal at the moment, we believe this benchmark is highly representative of the potential benefits of our optimization.

#### **6 Conclusion**

This paper makes two contributions. First, we separated politeness and strong politeness, which shows that sometimes, the (typically harder) task of finding a strong witness is not a waste of effort. Then, we provided an optimization to the polite combination method, which applies when one of the theories in the combination is stably infinite w.r.t. a subset of the sorts.

We envision several directions for future work. First, the sepration of politeness from strong politeness demonstrates a need to identify sufficient criteria for the equivalence of these notions — such as, for instance, the additivity criterion introduced by Sheng et al. [18]. Second, polite combination might be optimized by applying the witness function only to part of the purified input formula. Finally, we plan to extend the initial implementation of this approach in CVC4 and evaluate its impact based on more benchmarks.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Equational Theorem Proving Modulo**

Dohan Kim(B) and Christopher Lynch

Clarkson University, Potsdam, NY, USA {dohkim,clynch}@clarkson.edu

**Abstract.** Unlike other methods for theorem proving modulo with constrained clauses [12, 13], equational theorem proving modulo with constrained clauses along with its simplification techniques has not been well studied. We introduce a basic paramodulation calculus modulo equational theories E satisfying certain properties of E and present a new framework for equational theorem proving modulo E with constrained clauses. We propose an inference rule called Generalized E-Parallel for constrained clauses, which makes our inference system completely basic, meaning that we do not need to allow any paramodulation in the constraint part of a constrained clause for refutational completeness. We present a saturation procedure for constrained clauses based on relative reducibility and show that our inference system including our contraction rules is refutationally complete.

# **1 Introduction**

Equations occur frequently in many areas of mathematics, logics, and computer science. Equational theorem proving [6, 8, 19, 22] is, in general, concerned with proving mathematical or logical statements in first-order clause logic with equality. While resolution [24] has been successful for theorem proving for first-order clause logic without equality, it has some limitations to deal with the equality predicate. For example, when dealing with the equality predicate using resolution, one must add the congruence axioms explicitly for each predicate and function symbol in order to express the properties of equality [8, 22].

Paramodulation [23] is based on the replacement of equals by equals, in order to improve the efficiency of resolution in equational theorem proving. However, paramodulation, in general, often produces a large amount of unnecessary clauses, so the search space for a refutation expands very rapidly. Therefore, various improvements have been developed for paramodulation. For example, it was shown that the functional reflexivity equations used by the traditional paramodulation rule [23] are not needed, and paramodulation into variables does not need to be allowed (see [8]).

Basic paramodulation [9,20] restricts paramodulation by forbidding paramodulation at (sub)terms introduced by substitutions from previous inference steps, and uses orderings on terms and literals in order to further restrict paramodulation inferences. In [21, 26], basic paramodulation had been extended to basic paramodulation modulo associativity and commutativity (AC) axioms.

(See [25] also for basic paramodulation modulo the associativity (A) axiom.) Basic paramodulation modulo AC uses the symbolic constraints, overcoming a drawback of traditional paramodulation modulo AC (see [7,27]) that often generates many slightly different permuted variants of clauses. For example, more than a million conclusions can possibly be generated by paramodulating the equation x + x + x = x into the clause P(y<sup>1</sup> + y<sup>2</sup> + y<sup>3</sup> + y4) for which + is an AC symbol, since a minimal complete set of AC-unifiers for x + x + x and y<sup>1</sup> + y<sup>2</sup> + y<sup>3</sup> + y<sup>4</sup> contains more than a million AC-unifiers [21, 26]. On the other hand, one only needs a single conclusion <sup>P</sup>(x)|| <sup>x</sup> <sup>+</sup> <sup>x</sup> <sup>+</sup> <sup>x</sup> <sup>≈</sup>? AC y<sup>1</sup> + y<sup>2</sup> + y<sup>3</sup> + y<sup>4</sup> for the above inference using basic paramodulation modulo AC with an equality constraint.

In this paper, we present a new basic paramodulation calculus modulo equational theories E (including E = AC) parameterized by a suitable E-compatible ordering . Our main inference rule for basic paramodulation modulo E is given (roughly) as follows:

$$\frac{C \lor s \approx t \parallel \phi\_1 \qquad D \lor L[s'] \parallel \phi\_2}{C \lor D \lor L[t] \parallel s \approx\_E^? s' \land \phi\_1 \land \phi\_2}$$

The equality constraints are inherited and the accumulated E-unification problems are kept in the constraint part of conclusion. Instead of generating as many conclusions as minimal and complete E-unifiers of two terms s and s , a single conclusion is generated with its constraint keeping the E-unification problem of s and s . Another key inference rule in our basic paramodulation calculus modulo E is the Generalized E-Parallel (or E-Parallel) rule, adapted from our recent work on basic narrowing modulo [18]. This rule allows our basic paramodulation calculus to adapt the free case (i.e. E = ∅) to the modulo E case (i.e. <sup>E</sup> <sup>=</sup> <sup>∅</sup>).<sup>1</sup> For example, suppose that we have three clauses 1 : <sup>a</sup>+<sup>b</sup> <sup>≈</sup> <sup>c</sup>, 2 : a+ (b+x) ≈ c+x, and 3 : (a+a)+(b+b) ≈ c+c, where + is an AC symbol with + a b c. We use the E-Parallel rule from clause 1 and 2 and obtain the clause 4 : a + (b + (a + b)) ≈ c + c, which derives a contradiction with clause 3 because a + (b + (a + b)) ≈AC (a + a)+(b + b) (i.e. the equality constraint is satisfiable). The details of this inference rule are discussed in Section 4.

Throughout this paper, we assume that (i) we are given an E-compatible reduction ordering on terms with the subterm property that is E-total on ground terms, (ii) E has a finitary and complete unification algorithm, and (iii) E-congruence classes are finite. (If E satisfies condition (i), then E is necessarily regular [2].) With these assumptions of E, we can deal uniformly with different equational theories E in our framework and show that our inference system including our contraction rules is refutationally complete.

The known practical theories satisfying the above assumptions of E are AC and finite permutation theories [1, 17]. (For example, if one considers an ACI symbol + using our approach, then AC should be a modulo E part and the idempotency axiom (I : x + x ≈ x) should be a part of the input formulas.) Although associative (A)-unification is infinitary, our approach is also applicable

<sup>1</sup> If <sup>E</sup> <sup>=</sup> <sup>∅</sup>, then we may disregard the Generalized <sup>E</sup>-Parallel (or <sup>E</sup>-Parallel) rule along with the E-Completion rule and replace E-unification with syntactic unification.

to the case where E = A in practice, since there is a tool for A-unification which is guaranteed to terminate with a finite and complete set of A-unifiers for a significantly large class of A-unification problems (see [14]).

The longer version of this paper is found in [16].

## **2 Preliminaries**

We assume that the reader has some familiarity with rewrite systems [3] (including the extended rewrite system for R modulo E (i.e. R, E) [11, 15]) and unification [4]. We use the standard terminology of paramodulation [6, 9, 22].

We denote by T(F, X ) the set of terms over a finite set of function symbols F and a denumerable set of variables X . An equation is an expression s ≈ t, where s and t are (first-order) terms built from T(F, X ). A literal is either an equation L (a positive literal) or a negative equation ¬L (a negative literal). A clause is a finite multiset of literals, written as a disjunction of literals ¬A<sup>1</sup> ∨···∨¬A<sup>m</sup> ∨ B<sup>1</sup> ∨···∨ B<sup>n</sup> or as an implication Γ → Δ, where the multiset Γ is called the antecedent and the multiset Δ is called the succedent of the clause. (Recall that a multiset is an unordered collection with possible duplicate elements.)

An equational theory is a set of equations. (In this paper, an equational theory and a set of axioms are used interchangeably.) We denote by ≈<sup>E</sup> the least congruence on T(F, X ) that is closed under substitutions and contains a set of equations E. If s ≈<sup>E</sup> t for two terms s and t, then s and t are E-equivalent.

A (strict) ordering on terms is monotonic if s t implies u[s]<sup>p</sup> u[t]<sup>p</sup> for all s, t, u and positions p. An ordering on terms is stable under substitutions if s t implies sσ tσ for all s, t, and substitutions σ. An ordering on terms is a rewrite ordering if it is monotonic and stable under substitutions. A well-founded rewrite ordering is a reduction ordering. An ordering on terms has the subterm property if t[s]<sup>p</sup> s for all s, t, and p = λ. (In this paper, λ denotes the top position.) A simplification ordering is a rewrite ordering with the subterm property. An ordering on terms is E-compatible if s t, s ≈<sup>E</sup> s , and t ≈<sup>E</sup> t implies s t for all s, s , t and t . An ordering on ground terms is E-total if s ≈<sup>E</sup> t implies s t or t s for all ground terms s and t.

Given a multiset S and an E-compatible ordering on S, we say that x is maximal (resp. strictly maximal) in S if there is no y ∈ S (resp. y ∈ S \ {x}) with y x (resp. y x).

Clauses may also be considered as multisets of occurrences of equations. An occurrence of an equation s ≈ t in the antecedent of a clause is the multiset {{s, t}}, and in the succedent it is the multiset {{s}, {t}}. We denote ambiguously all those orderings on terms, equations and clauses by .

An equational theory is permutative if each equation in the theory contains the same symbols on both sides with the same number of occurrences. The depth of a term t is defined as depth(t) = 0 if t is a variable or a constant and depth(f(s1,...,sn)) = 1+max{depth(si)| 1 ≤ i ≤ n}. We say that an equational theory has maximum depth at most k if the maximum depth of all terms in the equations in the theory is less than or equal to k.

A (Herbrand) interpretation I is a congruence on ground terms. I satisfies (is a model of) a ground clause Γ → Δ, denoted by I |= Γ → Δ, if I ⊇ Γ or I ∩ Δ = ∅. In this case, we say that Γ → Δ is true in I. A ground clause C follows from a set of ground clauses {C1,...,Ck} |= C if C is true in every model of {C1,...,Ck}.

#### **3 Constrained Clauses**

**Definition 1** (Constrained clauses) [22,26] A constrained clause is a pair C || φ, where C is a clause and φ is an equality constraint consisting of a conjunction of the form <sup>s</sup> <sup>≈</sup>? <sup>E</sup> t for terms s and t. The set of solutions of a constraint φ, denoted by Sol(φ), is the set of the ground substitutions defined inductively as:

$$\begin{array}{c} Sol(\phi\_1 \wedge \phi\_2) = Sol(\phi\_1) \cap Sol(\phi\_2), \\ Sol(s \stackrel{\approx}{\approx}\_E t) = \{ \sigma \mid s\sigma \text{ and } t\sigma \text{ are } E\text{-equivalent} \}, \end{array}$$

A constraint φ is satisfiable if it admits at least one solution.

A constrained clause with an unsatisfiable constraint is a tautology. If every ground substitution with domain V ars(φ) of C || φ is a solution of φ, then φ is a tautological constraint. An unconstrained clause can also be considered as a constrained clause with a tautological constraint.

The main technical difficulties in lifting a reduced ground inference to an inference at the clause level in a basic paramodulation inference system involve a ground clause of the form Cσ := Dσ ∨ xσ ≈ tσ with C := D ∨ x ≈ t|| φ and σ ∈ Sol(φ), where xσ ⇒ tσ ∈ R for a given ground rewrite system R. This motivates the following definition of irreducibility to lift a reduced ground inference to an inference at the clause level in our inference system. (See [9] also for order-irreducibility in the free case.)

**Definition 2** (Order-irreducibility) Given a ground rewrite system R and an equational theory E, a ground literal L[l ]<sup>p</sup> is order-reducible (at position p) by R, E with l ⇒ r ∈ R if l ≈<sup>E</sup> l,l r and L l ≈ r. A literal L[s] is order-irreducible in s by R, E if L[s] is not order-reducible at any position of s.

In Definition 2, the condition L l ≈ r is always true when L is a negative literal or else l does not occur at the top (i.e. p = λ) of the largest term of L.

**Definition 3** (Reduced ground instances) Given a ground rewrite system R and an equational theory E, Cσ is a ground instance of C || φ if σ is a solution of φ (i.e. σ ∈ Sol(φ)). It is a reduced ground instance of C || φ w.r.t. R, E if σ is a solution of φ and each ground literal L[xσ] in Cσ is order-irreducible in xσ by R, E for each variable x ∈ V ars(C). In this case, σ is a reduced solution of C || φ w.r.t. R, E.

**Definition 4** (A model of a constrained clause) An interpretation I satisfies (is a model of) a constrained clause C || φ, denoted by I |= C || φ, if it satisfies every ground instance of C || φ (i.e. every Cσ for which σ is a solution of φ).

**Definition 5** (Reductiveness, weak reductiveness, semi-reductiveness, and weak maximality) An equation s ≈ t is reductive (resp. weakly reductive) for C || φ := D ∨ s ≈ t|| φ if there exists a ground instance Cσ such that sσ ≈ tσ is strictly maximal (resp. maximal) in Cσ with sσ tσ. The clause C || φ is simply called reductive if there exists a reductive equation s ≈ t for C || φ. A negative equation u ≈ v is semi-reductive (resp. weakly reductive) for C || φ := D∨u ≈ v || φ if there exists a ground instance Cσ such that uσ vσ (resp. uσ vσ and uσ ≈ vσ is maximal in Cσ). A literal L is weakly maximal for C || φ := D ∨ L|| φ if there exists a ground instance Cσ such that Lσ is maximal in Cσ.

## **4 Inference Rules**

The inference rules in our inference system are parameterized by a selection function S and an E-compatible reduction ordering with the subterm property that is E-total on ground terms, where S selects at most one (occurrence of a) negative literal in the clause part C of each (constrained) clause C || φ. For technical convenience, if a literal L is selected in C, then we also say that L is selected in C || φ. In our inference rules, a literal in a clause C || φ is involved in some inference if it is selected in C (by S) or nothing is selected and it is maximal in C (cf. [8]). The following Basic Paramodulation rule is our main inference rule for equational theorem proving modulo E, where only the maximal sides of literals in clauses are involved in inferences by this rule. We rename variables in the premises in our inference rules if necessary so that no variable is shared between premises (i.e. standardized apart).

#### **Basic Paramodulation**

$$\frac{C \lor s \approx t \parallel \phi\_1 \qquad D \lor L[s'] \parallel \phi\_2}{C \lor D \lor L[t] \parallel s \approx\_E^? s' \land \phi\_1 \land \phi\_2} \quad \text{if } t$$

	- (a) L is selected in the right premise, and L is of the form u[s ] ≈ v and is semi-reductive for the right premise.
	- (b) nothing is selected in the right premise, and L is of the form u[s ] ≈ v and is reductive for the right premise.
	- (c) nothing is selected in the right premise, and L is of the form u[s ] ≈ v and is weakly reductive for the right premise.

#### **Equality Resolution**

$$\frac{C \lor s \not\approx t \parallel \phi}{C \parallel s \stackrel{?}{\approx\_E^?t \land \phi}}\qquad\text{if}$$

s ≈ t is selected, or else nothing is selected and s ≈ t is weakly maximal for the premise.

#### **E**-**Factoring**

$$\frac{C \lor s \approx t \lor s' \approx t' \parallel \phi}{C \lor t \not\approx t' \lor s' \approx t' \parallel s \approx\_E^?s' \land \phi} \quad \text{if } \phi$$

s ≈ t is weakly reductive for the premise, and C contains no selected literal.

#### **E**-**Completion**

$$\frac{C \lor s \approx t \parallel \phi}{C \lor e\_1[t]\_p \approx e\_2 \parallel s \approx\_E^? s' \land \phi} \quad \text{if} \quad$$

1. e1[s ]<sup>p</sup> ≈ e<sup>2</sup> ∈ E and p = λ, where s is not a variable,

2. s ≈ t is reductive for the premise, and C contains no selected literal.

The above E-Completion rule is an adaptation of the E-closure [27] rule using equality constraints (cf. E-extension [5]).

#### **E**-**Parallel**

$$\frac{C \lor s \approx t \parallel \phi\_1 \qquad D \lor l \approx r \parallel \phi\_2}{C \lor D \sigma \lor l \sigma \approx r \theta \parallel \phi\_1 \land \phi\_2} \quad \text{if}$$


#### **Generalized E**-**Parallel**

$$\frac{C \lor s \approx t \mid \mid \phi\_1 \qquad D \lor l \approx r \mid \mid \phi\_2}{C \lor D \sigma \lor l \sigma \approx r \theta \mid \mid \phi\_1 \land \phi\_2} \quad \text{if} \quad$$


6. there is a term u with u ≈<sup>E</sup> lσ, such that u is R, E-reducible with R = {l ⇒ r, s ⇒ t} only at the top position.

We mark each clause produced by the Generalized E-Parallel (or E-Parallel) rule as "protected" so that it is protected from our contraction rules discussed in Section 5. (We simply say each marked clause is a protected clause.) Protected clauses behave the same way as other clauses in our inference rules, but our contraction rules are not applied to protected clauses (see Section 5 for details).

We may also use predicate terms [6] P(t1,...,tn) in our inference system, where a predicate term cannot be a proper subterm of any term. Note that a predicate term P(t1,...,tn) can be expressed as an equation P(t1,...,tn) ≈ , where is a special constant symbol minimal in the ordering and P is considered as a function symbol. (In this sense, ¬P(t1,...,tn) can be expressed as P(t1,...,tn) ≈ .) In the remainder of this paper, by BP we denote the inference system consisting of the Basic Paramodulation, Equality Resolution, E-Factoring, E-Completion, and the Generalized E-Parallel rule. If E is a permutative theory with maximum depth at most 2 (e.g. E = A, C, or AC), then we use the simpler E-Parallel rule instead of the Generalized E-Parallel rule in BP (see Lemma 6).

Example 1. Let + be an AC symbol (in infix notation) with + a b 0 and consider the following inconsistent set of clauses 1: x + 0 ≈ x, 2: a + a ≈ 0, 3: b + b ≈ 0, and 4: (a + b)+(a + b) ≈ 0. Now we show how the empty clause (with a satisfiable constraint) is derived:

5: (<sup>x</sup> <sup>+</sup> <sup>y</sup>) + <sup>z</sup> <sup>≈</sup> <sup>x</sup> + 0 || <sup>y</sup> <sup>+</sup> <sup>z</sup> <sup>≈</sup>? AC a + a (E-Completion with 2 using the associativity axiom x + (y + z) ≈ (x + y) + z.)

6: ((<sup>b</sup> <sup>+</sup> <sup>b</sup>) + <sup>y</sup>) + <sup>z</sup> <sup>≈</sup> 0+0 || <sup>y</sup> <sup>+</sup> <sup>z</sup> <sup>≈</sup>? AC a + a (E-Parallel with 3 into 5. In condition 5 of the E-Parallel rule, term u corresponds to (b + y)+(b + z) here.) 7: 0 + 0 ≈ <sup>0</sup> ||((<sup>b</sup> <sup>+</sup> <sup>b</sup>) + <sup>y</sup>) + <sup>z</sup> <sup>≈</sup>? AC (<sup>a</sup> <sup>+</sup> <sup>b</sup>)+(<sup>a</sup> <sup>+</sup> <sup>b</sup>) <sup>∧</sup> <sup>y</sup> <sup>+</sup> <sup>z</sup> <sup>≈</sup>? AC a + a (Basic Paramodulation with 6 into 4)

8: <sup>x</sup> ≈ <sup>0</sup> || <sup>x</sup>+0 <sup>≈</sup>? AC 0+0 <sup>∧</sup> ((b+b)+y)+<sup>z</sup> <sup>≈</sup>? AC (a+b)+(a+b) <sup>∧</sup> <sup>y</sup>+<sup>z</sup> <sup>≈</sup>? AC a+a (Basic Paramodulation with 1 into 7) 9: - || <sup>x</sup> <sup>≈</sup>? AC <sup>0</sup> <sup>∧</sup> <sup>x</sup>+0 <sup>≈</sup>? AC 0+0 <sup>∧</sup> ((b+b)+y)+<sup>z</sup> <sup>≈</sup>? AC (a+b)+(a+b)∧y+<sup>z</sup> <sup>≈</sup>? AC a + a (Equality Resolution on 8)

In contrast, the existing approaches for basic paramodulation modulo AC [21, 26] use clauses 2 and 4, for example, and produce clause 5 : 0+<sup>x</sup> ≈ <sup>0</sup> || <sup>x</sup> <sup>≈</sup>? AC b+b and then clause 6 : 0+ <sup>y</sup> ≈ <sup>0</sup> || <sup>x</sup> <sup>≈</sup>? AC <sup>b</sup>+<sup>b</sup> <sup>∧</sup> <sup>y</sup> <sup>≈</sup>? AC 0 by their inference rules. Then 6 is used to derive a contradiction with 1. It can be viewed that 6 is obtained from 5 by an indirect paramodulation with 3 in the constraint part. In our approach, we simply block clauses like 5 from further inferences (see Definition 12), and no direct or indirect paramodulation is allowed in the constraint part of any clause.

Example 2. Consider S = {f(g(x)) ≈ x, a ≈ b, c ≈ g(b)} and E = {f(g(g(a))) ≈ c} with f g a b, where E is a regular theory with maximum depth 3. The Generalized E-Parallel rule with premises f(g(x)) ≈ x and a ≈ b produces the conclusion f(g(g(a))) ≈ g(b). (Choose l as f(g(x)), s as a, and u as g(a) in the Generalized E-Parallel rule.) Then it is used to derive a contradiction with clause c ≈ g(b) since f(g(g(a))) ≈<sup>E</sup> c.

In the above example, a suitable E-compatible reduction ordering on ground terms is obtained in such a way that given two ground terms, we rewrite each occurrence of c in each ground term into f(g(g(a))) at the same position with (the occurrence of) c and then use the standard lexicographic path ordering [3, 22] for comparing (rewritten) ground terms without any occurrence of c. Then we may compare terms with variables by considering ground substitutions and using this ordering on ground terms.

In what follows, by the Parallel rule we mean the E-Parallel or the Generalized E-Parallel rule. First, observe that we cannot derive a contradiction in both Examples 1 and 2 using inference rules in BP without the Parallel rule. The intuition behind the Parallel rule is that above all, a reductive ground clause corresponds to a reductive ground conditional rewrite rule [19] with positive and negative conditions. Therefore, roughly speaking, the premises of the Parallel rule are reductive conditional rewrite rules with positive and negative conditions. (The Parallel rule applies to only reductive clauses.) Now the conclusion of the Parallel rule combines two steps: (i) instantiating a "problematic" variable in a special and restricted way, and (ii) selectively rewriting an instantiated term if conditions are met. (Therefore, conditions C is included in the conclusion.) A problematic variable is often determined by a built-in equational theory E. It is mostly a variable produced by an E-Completion inference (see Example 1) for AC cases, which is the counterpart of an extension variable for AC-extension [7, 27].

Observe that the Generalized E-Parallel rule is more general than the E-Parallel rule. If p is always the top position for the Generalized E-Parallel rule, then they are equivalent. This is the case for permutative theories with maximum depth at most 2 (e.g. E = A, C, or AC).

**Lemma 6** If E is a permutative theory with maximum depth at most 2, then the E-Parallel rule and the Generalized E-Parallel rule are equivalent, i.e., they generate the same conclusion for the same input premises.

Note that the E-Completion and the Parallel rule are not always needed for every built-in equational theory E. The following example is a simple variant of the reachability problem [15] modulo a permutation theory [1, 17], where ¬P(f(c, b, b, d, e)) is the query from the initial configuration P(f(a, b, c, d, e)). We may view E in the following example as all permutations of variables x1, x2, x3, x4, and x5, since the symmetric group S<sup>5</sup> is generated by two cycles (1 2) and (1 2 3 4 5).

Example 3. Let E = {f(x1, x2, x3, x4, x5) ≈ f(x2, x1, x3, x4, x5), f(x1, x2, x3, x4, x5) ≈ f(x2, x3, x4, x5, x1)} with P f a b c d e and consider the following set of clauses 1: ¬P(f(c, b, b, d, e)), 2: P(f(a, b, c, d, e)), and 3: f(a, b, x, y, z) ≈ f(b, b, x, y, z). Basic Paramodulation with 3 into 2 yields clause 4: <sup>P</sup>(f(b, b, x, y, z))|| <sup>f</sup>(a, b, x, y, z) <sup>≈</sup>? <sup>E</sup> f(a, b, c, d, e). By applying Basic Paramodulation with 1 and 4 (using P(f(c, b, b, d, e)) ≈ and <sup>P</sup>(f(b, b, x, y, z)) ≈ || <sup>f</sup>(a, b, x, y, z) <sup>≈</sup>? <sup>E</sup> f(a, b, c, d, e)) and then applying Equality Resolution, we have clause 5: - || <sup>f</sup>(b, b, x, y, z) <sup>≈</sup>? <sup>E</sup> f(c, b, b, d, e) ∧ <sup>f</sup>(a, b, x, y, z) <sup>≈</sup>? <sup>E</sup> f(a, b, c, d, e). The equality constraint in 5 is satisfiable and we have a contradiction. Note that clause 4 schematizes the set of ground clauses {P(f(b, b, c, d, e)), P(f(b, b, c, e, d)), P(f(b, b, d, c, e)), P(f(b, b, d, e, c)), P(f(b, b, e, c, d)), P(f(b, b, e, d, c))}.

# **5 Redundancy Criteria and Contraction Techniques**

**Definition 7** (Relative reducibility) Given an equational theory E, a ground instance Cσ<sup>1</sup> of C || φ<sup>1</sup> is reduced relative to a ground instance Dσ<sup>2</sup> of D || φ<sup>2</sup> if for any rewrite system R, Cσ<sup>1</sup> is a reduced ground instance of C || φ<sup>1</sup> w.r.t. R, E whenever Dσ<sup>2</sup> is a reduced ground instance of D || φ<sup>2</sup> w.r.t. R, E.

In what follows, the relation on terms represents the subterm relation, i.e., <sup>s</sup> <sup>t</sup> if <sup>s</sup> is a subterm of <sup>t</sup>. The relation on sets of terms is defined as follows: {s1,...,sm}{t1,...,tn} if for all 1 ≤ i ≤ m, there is some 1 ≤ j ≤ n such that <sup>s</sup><sup>i</sup> <sup>t</sup><sup>j</sup> , and ∅ <sup>X</sup> for any set of terms <sup>X</sup>. Given a clause <sup>C</sup> || <sup>φ</sup>, we denote by Ran(σ|V ars(C)) for some σ ∈ Sol(φ) the range of the restriction of σ to the set of variables V ars(C) if V ars(C) = ∅. If C is a ground clause with a tautological constraint (e.g. the empty constraint), then we set Ran(σ|V ars(C)) = ∅. (Note that any ground substitution is a solution of a tautological constraint.)

We say that a clause C || φ is a clause with a succedent top variable [21] w.r.t. σ ∈ Sol(φ) if there is a variable x ∈ V ars(C)∩V ars(φ) only appearing in equations x ≈ t of the succedent of C with xσ tσ for some t. The following lemma, which directly follows from Definition 7, is a sufficient syntactic condition for Cσ<sup>1</sup> being reduced relative to Dσ<sup>2</sup> in Definition 7 if D || φ<sup>2</sup> is not a clause with a succedent top variable w.r.t. σ2. If D || φ<sup>2</sup> is a clause with a succedent top variable x w.r.t. some σ<sup>2</sup> ∈ Sol(φ2), then one may (partially) instantiate x in D with σ<sup>2</sup> if possible, so that one may use the syntactic condition for checking whether Cσ<sup>1</sup> is reduced relative to Dσ<sup>2</sup> as in the following lemma.

**Lemma 8** Given an equational theory E, a ground instance Cσ<sup>1</sup> of C || φ<sup>1</sup> is reduced relative to a ground instance Dσ<sup>2</sup> of D || φ<sup>2</sup> if Ran(σ1|V ars(C)) Ran(σ2|V ars(D)) and D || φ<sup>2</sup> is not a clause with a succedent top variable w.r.t. σ2.

In what follows, we denote by E≺<sup>C</sup> (resp. R≺<sup>C</sup> ) the set of ground instances of equations in E (resp. the set of ground rewrite rules in R) smaller than the ground clause C (w.r.t. ), and by S modulo E a set of clauses S with a built-in equational theory E.

**Definition 9** (Redundancy) A clause C || φ is redundant in S modulo E (w.r.t. relative reducibility) if for every ground instance Cσ, there exist ground instances C1σ1,...,Ckσ<sup>k</sup> of clauses C<sup>1</sup> || φ1,...,C<sup>k</sup> || φ<sup>k</sup> in S reduced relative to Cσ, such that Cσ <sup>C</sup>iσi, 1 <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>k</sup>, and {C1σ1,...,Ckσk}∪R≺Cσ∪E≺Cσ <sup>|</sup><sup>=</sup> Cσ for any ground rewite system R contained in . (In this case, we also say that each Cσ is redundant in S modulo E (w.r.t. relative reducibility).)

**Definition 10** (Basic E-simplification) An equation l ≈ r simplifies a clause C ∨ L[l ]<sup>p</sup> || φ into C ∨ L[rρ]<sup>p</sup> || φ if the following conditions are met:


**Lemma 11** If an equation l ≈ r simplifies a clause C ∨ L[l ]<sup>p</sup> || φ into C ∨ L[rρ]<sup>p</sup> || φ as in Definition 10, then C ∨ L[l ]<sup>p</sup> || φ is redundant in S modulo E, where S = {l ≈ r, C ∨ L[rρ]<sup>p</sup> || φ}.

The following definition extends the blocking rule in the free case (see [9]) to the modulo case, where a blocked clause does not contribute to finding a refutation during a theorem proving derivation w.r.t. BP (see Definition 16) starting with an initial set of unconstrained clauses.

**Definition 12** (Basic E-blocking) A clause C || φ is blocked in S modulo E if the following conditions are met:


**Definition 13** (Basic E-instance) A clause C || φ is a basic E-instance in S modulo E if the following conditions are met:


Observe that protected clauses are produced in a restricted way (e.g. see condition 5 in the E-Parallel rule) and if two protected clauses are the same up to variable renaming, then they are basic E-instances of each other and they do not need to be distinguished.

**Definition 14** (Redundancy of an inference) An inference π with conclusion D || φ is redundant in S modulo E (w.r.t. relative reducibility) if D || φ is blocked or a basic E-instance in S modulo E, or for every ground instance πσ with maximal premise C and conclusion Dσ, there exist ground instances C1σ1,...,Ckσ<sup>k</sup> of clauses C<sup>1</sup> || φ1,...,C<sup>k</sup> || φ<sup>k</sup> in S reduced relative to Dσ, such that C Ciσi, <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>k</sup>, and {C1σ1,...,Ckσk} ∪ <sup>R</sup>≺<sup>C</sup> <sup>∪</sup> <sup>E</sup>≺<sup>C</sup> <sup>|</sup><sup>=</sup> Dσ for any ground rewrite system R contained in .

The following lemma immediately follows from Definition 9 and the observation that if {C1σ1,...,Ckσk} ∪ <sup>E</sup>≺Cσ <sup>|</sup><sup>=</sup> Cσ, then {C1σ1,...,Ckσk} ∪ <sup>R</sup>≺Cσ <sup>∪</sup> <sup>E</sup>≺Cσ <sup>|</sup><sup>=</sup> Cσ for any ground rewite system <sup>R</sup> contained in , which serves as a sufficient condition for redundancy of clauses. Also, if an (unconstrained) clause C properly subsumes an (unconstrained) clause C ∨ D in the classical sense, where C and C are the same up to variable renaming, then it is easy to see that C ∨ D is redundant in {C} modulo E.

**Lemma 15** A clause C || φ is redundant in S modulo E if for every ground instance Cσ, there exist ground instances C1σ1,...,Ckσ<sup>k</sup> of clauses C<sup>1</sup> || φ1,..., C<sup>k</sup> || φ<sup>k</sup> in S reduced relative to Cσ, such that Cσ Ciσi, 1 ≤ i ≤ k, and {C1σ1,...,Ckσk} ∪ <sup>E</sup>≺Cσ <sup>|</sup><sup>=</sup> Cσ.

**Definition 16** (Theorem proving derivation) A theorem proving derivation is a sequence of sets of clauses S<sup>0</sup> = S, S1,... such that:

(i) Deduction: S<sup>i</sup> = S<sup>i</sup>−<sup>1</sup> ∪ {C || φ} for some C || φ if it can be deduced from premises in S<sup>i</sup>−<sup>1</sup> by applying an inference rule in BP or basic E-simplification. (ii) Deletion: S<sup>i</sup> = S<sup>i</sup>−<sup>1</sup> \ {D || ψ} for some D || ψ if it is not protected, and is redundant or blocked in S<sup>i</sup>−<sup>1</sup> modulo E.

The set S<sup>∞</sup> of persistent clauses is defined as - i( <sup>j</sup>≥<sup>i</sup> <sup>S</sup><sup>j</sup> ), which is called the limit of the derivation. A theorem proving derivation S0, S1, S2,... is fair [6] w.r.t. the inference system BP if every inference π by BP with premises in S<sup>∞</sup> is redundant in - <sup>j</sup> S<sup>j</sup> modulo E.

**Definition 17** (Saturation w.r.t. relative reducibility) Given an equational theory E, we say that S modulo E is saturated under BP w.r.t. relative reducibility if every inference by BP with premises in S is redundant in S modulo E.

In what follows, we say that a clause C || φ is non-protected redundant (resp. non-protected blocked) in S modulo E if it is not protected and is redundant (resp. blocked) in S modulo E. (If C || φ is non-protected redundant in S modulo E, then we also say that each ground instance Cσ of C || φ is non-protected redundant in S modulo E.)

**Lemma 18** (i) If S ⊆ S , then any clause which is non-protected redundant or non-protected blocked in S modulo E is also non-protected redundant or nonprotected blocked in S modulo E.

(ii) Let S ⊆ S such that all clauses in S \ S are non-protected redundant or non-protected blocked in S modulo E. Then (ii.1) any clause which is nonprotected redundant or non-protected blocked in S modulo E is also non-protected redundant or non-protected blocked in S modulo E, and (ii.2) any inference which is redundant in S modulo E is also redundant in S modulo E.

**Lemma 19** Let S0, S1,... be a fair theorem proving derivation w.r.t. BP such that S<sup>0</sup> is a set of unconstrained clauses. Then S<sup>∞</sup> modulo E is saturated under BP w.r.t. relative reducibility.

Proof. If S<sup>∞</sup> contains the empty clause, then it is immediate that S<sup>∞</sup> modulo E is saturated under BP w.r.t. relative reducibility, so we assume that the empty clause is not in S∞.

If a clause C || φ is deleted in a theorem proving derivation, then we see that it is non-protected redundant or non-protected blocked in some S<sup>j</sup> modulo E. It is also non-protected redundant or non-protected blocked in - <sup>j</sup> S<sup>j</sup> modulo E by Lemma 18(i). Similarly, every clause in - <sup>j</sup> S<sup>j</sup> \ S<sup>∞</sup> is non-protected redundant or non-protected blocked in - <sup>j</sup> S<sup>j</sup> modulo E.

Now by fairness of the derivation, every inference π by BP with premises in S<sup>∞</sup> is redundant in - <sup>j</sup> S<sup>j</sup> modulo E. Then by Lemma 18(ii.2) and the above, π is also redundant in S<sup>∞</sup> modulo E. Thus, S<sup>∞</sup> modulo E is saturated under BP w.r.t. relative reducibility.

# **6 Refutational Completeness**

The soundness of BP (w.r.t. a fair theorem proving derivation) is straightforward, i.e., S<sup>i</sup> ∪ E |= Si+1 ∪ E for all i ≥ 0. If the empty clause is in some S<sup>j</sup> , then S<sup>0</sup> ∪ E is unsatisfiable by the soundness of BP. The following theorem states that BP with our contraction rules (i.e. basic E-simplification and basic E-blocking) is refutationally complete. In order to prove the following theorem, we adapt a variant of model construction techniques [7–9, 21, 27]. In this section, we assume that the equality is the only predicate by expressing other predicates (i.e. predicate terms) as (predicate) equations as discussed in Section 4.

**Theorem 20** Let S0, S1,... be a fair theorem proving derivation w.r.t. BP such that S<sup>0</sup> is a set of unconstrained clauses. Then S<sup>0</sup> ∪E is unsatisfiable if and only if the empty clause is in some S<sup>j</sup> .

**Definition 21** (Model construction) Let S be a set of (constrained) clauses. We use induction on to define the sets Rules<sup>C</sup> , R<sup>C</sup> , E<sup>C</sup> , and I<sup>C</sup> , for all ground instances C of clauses in S. Let C be such a ground instance of a clause in S and suppose that Rules<sup>C</sup> has been defined for all ground instances C of clauses in S for which C C . Then we define by R<sup>C</sup> = - CC- Rules<sup>C</sup> and by E<sup>C</sup> the set of ground instances e<sup>1</sup> ≈ e<sup>2</sup> of equations in E, such that C e<sup>1</sup> ≈ e2, and e<sup>1</sup> and e<sup>2</sup> are both irreducible by R<sup>C</sup> . We also define by I<sup>C</sup> the interpretation (R<sup>C</sup> ∪ E<sup>C</sup> )<sup>∗</sup> (i.e. the least congruence containing R<sup>C</sup> ∪ E<sup>C</sup> ).

Now let C := D∨s ≈ t be a reduced ground instance of a clause in S w.r.t. R<sup>C</sup> such that C is not an instance of a clause with a selected literal. Then C produces the set of ground rewrite rules Rules<sup>C</sup> = {u ⇒ t| u ≈<sup>E</sup> s and u is irreducible by R<sup>C</sup> } if the following conditions are met: (1) I<sup>C</sup> |= C (resp. I<sup>C</sup> |= D) if C is an instance of a non-protected clause (resp. protected clause), (2) I<sup>C</sup> |= t ≈ t for every s ≈ t in D with s ≈<sup>E</sup> s, (3) s ≈ t is reductive for C, and (4) there exists u with u ≈<sup>E</sup> s for which u is irreducible by R<sup>C</sup> . We say that C is productive and produces Rules<sup>C</sup> if it satisfies all of the above conditions. Otherwise, Rules<sup>C</sup> = ∅. Finally, we define R<sup>S</sup> = - <sup>C</sup> R<sup>C</sup> , E<sup>S</sup> = - <sup>C</sup> E<sup>C</sup> , and I<sup>S</sup> = (R<sup>S</sup> ∪ ES)∗.

We may include the special non-productive ground clause tt ≈ tt in S for the above (inductive) definition, where tt ≈ tt is assumed to be greater than all ground instances of clauses in S∪E w.r.t. other than tt ≈ tt itself (see [21,27]). (If C is the strictly maximal ground instance among ground instances of clauses in S and is productive, then R<sup>S</sup> may not include Rules<sup>C</sup> by the above inductive definition of R<sup>C</sup> without tt ≈ tt.) In what follows, we say that a ground instance πσ of an inference π with premises in S is reduced if each premise and conclusion of πσ is a reduced ground instance of a clause in S ∪ E w.r.t. RS, ES.

**Definition 22** (Redundancy w.r.t. RS, ES) A clause C || φ is redundant in S modulo E w.r.t. RS, E<sup>S</sup> if for every reduced ground instance Cσ w.r.t. RS, ES, there exist reduced ground instances C1σ1,...,Ckσ<sup>k</sup> of clauses C<sup>1</sup> || φ<sup>1</sup> ...C<sup>k</sup> || φ<sup>k</sup> in S w.r.t. RS, ES, such that Cσ Ciσi, 1 ≤ i ≤ k, and {C1σ1,...,Ckσk} ∪ R≺Cσ <sup>S</sup> <sup>∪</sup> <sup>E</sup>≺Cσ <sup>|</sup><sup>=</sup> Cσ. (In this case, we also say that each Cσ is redundant in <sup>S</sup> modulo E w.r.t. RS, ES.)

An inference π with conclusion D || φ is redundant in S modulo E w.r.t. RS, E<sup>S</sup> if D || φ is blocked or a basic E-instance in S modulo E, or for every reduced ground instance πσ with maximal premise C and conclusion Dσ, there exist reduced ground instances C1σ1,...,Ckσ<sup>k</sup> of clauses C<sup>1</sup> || φ1,...,C<sup>k</sup> || φ<sup>k</sup> in S w.r.t. <sup>R</sup>S, ES, such that <sup>C</sup> <sup>C</sup>iσi, 1 <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>k</sup>, and {C1σ1,...,Ckσk} ∪ <sup>R</sup>≺<sup>C</sup> <sup>S</sup> ∪ <sup>E</sup>≺<sup>C</sup> <sup>|</sup><sup>=</sup> Dσ.

**Definition 23** (Saturation w.r.t. RS, ES) Given an equational theory E, we say that S modulo E is saturated under BP w.r.t. RS, E<sup>S</sup> if every inference by BP with premises in S is redundant in S modulo E w.r.t. RS, ES.

**Lemma 24** (i) There are no overlaps among the left-hand sides of rules in RS. (ii) A term t is reducible by R<sup>S</sup> if and only if it is reducible by RS, E<sup>S</sup> at the same position.

(iii) For every l ⇒ r, s ⇒ t ∈ RS, if l ≈<sup>E</sup> s, then r and t are the same term. (iv) RS/E<sup>S</sup> is terminating.

(v) For ground terms u and v, if I<sup>S</sup> |= u ≈ v, then u ↓<sup>R</sup><sup>S</sup> ,E<sup>S</sup> v.

(vi) If a ground instance Cθ := Dθ ∨ lθ ≈ rθ of a clause C || φ := D ∨ l ≈ r || φ is productive, then it is a reduced ground instance of C || φ w.r.t. RS, ES.

The proofs of (i), (ii), and (iii) in Lemma 24 follow from the construction of R<sup>S</sup> in Definition 21. For (iv), since R<sup>S</sup> is contained in an E-compatible reduction ordering on terms that is E-total on ground terms, RS/E<sup>S</sup> is terminating. Meanwhile, Lemma 24(v) describes the ground Church-Rosser property [19] of RS, ES. Since RS/E<sup>S</sup> is terminating by (iv), this shows that RS, E<sup>S</sup> is ground convergent modulo ES. In the following, we assume that any saturated clause set under BP is obtained from an initial set of clauses without constraints.

**Lemma 25** Let S modulo E be saturated under BP w.r.t. RS, E<sup>S</sup> not containing the empty clause and let C be a reduced ground instance of a clause in S w.r.t. RS, E<sup>S</sup> or a ground instance of an equation in E. Then C is true in IS. More specifically,

(i) C is not an instance of a blocked clause in S modulo E.

(ii) If C is redundant in S modulo E w.r.t. RS, ES, then it is true in IS.

(iii) If C is an instance of a clause with a selected literal, then it is true in IS.

(iv) If C contains a maximal negative literal (w.r.t. ) and is not an instance of a clause with a selected literal, then it is true in IS.

(v) If C is an instance of an equation in E, then it is true in IS.

(vi) If C is an instance of a protected clause or a basic E-instance of it, then it is true in IS.

(vii) If C is non-productive, then it is true in IS.

(viii) If C := C∨s ≈ t is productive and produces Rules<sup>C</sup> with s ⇒ t ∈ Rules<sup>C</sup> , then C is false and C is true in IS.

We leave it to the reader to verify the following lemma using the definitions of redundancy of an inference w.r.t. relative reducibility and w.r.t. RS, ES, along with Lemma 19.

**Lemma 26** Let S0, S1,... be a fair theorem proving derivation w.r.t. BP such that S<sup>0</sup> is a set of unconstrained clauses. Then S<sup>∞</sup> modulo E is saturated under BP w.r.t. R<sup>S</sup>∞, E<sup>S</sup>∞.

**Theorem 27** Let S0, S1,... be a fair theorem proving derivation w.r.t. BP such that S<sup>0</sup> is a set of unconstrained clauses. If S<sup>∞</sup> does not contain the empty clause, then I<sup>S</sup><sup>∞</sup> |= S<sup>0</sup> ∪ E (i.e., S<sup>0</sup> ∪ E is satisfiable).

Proof. By Lemma 26, we know that S<sup>∞</sup> modulo E is saturated under BP w.r.t. R<sup>S</sup>∞, E<sup>S</sup>∞. Let C be a ground instance of an equation in E or a ground instance of a clause C in S0. By Lemma 25(v), if C is a ground instance of an equation in E, then it is true in I<sup>S</sup>∞. Therefore, we assume that C is not a ground instance of an equation in E. Suppose first that C := C σ is a reduced ground instance of C ∈ S<sup>0</sup> w.r.t. R<sup>S</sup>∞, E<sup>S</sup>∞. Then there are two cases to consider. If C ∈ S∞, then C is true in I<sup>S</sup><sup>∞</sup> by Lemma 25. Otherwise, if C ∈ S∞, then C is (non-protected) redundant in some S<sup>j</sup> modulo E w.r.t. relative reducibility because C ∈ S<sup>0</sup> (with the empty constraint) is neither protected nor can it be a blocked clause in some S<sup>j</sup> modulo E. Thus, C is (nonprotected) redundant in - <sup>j</sup> S<sup>j</sup> modulo E w.r.t. relative reducibility, and hence is (non-protected) redundant in S<sup>∞</sup> modulo E w.r.t. relative reducibility by Lemma 18. It follows that there exist ground instances C1σ1,...,Ckσ<sup>k</sup> of clauses C<sup>1</sup> || φ1,...,C<sup>k</sup> || φ<sup>k</sup> in S<sup>∞</sup> reduced relative to C, such that C Ciσi, 1 ≤ i ≤ k, and {C1σ1,...,Ckσk} ∪ <sup>R</sup>≺<sup>C</sup> <sup>∪</sup> <sup>E</sup>≺<sup>C</sup> <sup>|</sup><sup>=</sup> <sup>C</sup> for any ground rewrite system <sup>R</sup> contained in . Since C is a reduced ground instance of C w.r.t. R<sup>S</sup>∞, E<sup>S</sup>∞, we see that Ciσi, 1 ≤ i ≤ k, are also reduced ground instances w.r.t. R<sup>S</sup>∞, E<sup>S</sup><sup>∞</sup> by Definition 7 and are true in <sup>I</sup><sup>S</sup><sup>∞</sup> by Lemma 25. Similarly, <sup>R</sup>≺<sup>C</sup> <sup>S</sup><sup>∞</sup> and <sup>E</sup>≺<sup>C</sup> are true in I<sup>S</sup><sup>∞</sup> by Lemma 25, and hence we may infer that C is also true in I<sup>S</sup>∞.

Now suppose that C := C σ is a reducible ground instance of C ∈ S<sup>0</sup> w.r.t. R<sup>S</sup>∞, E<sup>S</sup>∞. Let σ be a ground substitution such that xσ = xσ ↓<sup>R</sup>S∞,ES<sup>∞</sup> for each x ∈ V ars(C ). Since C σ is a reduced ground instance of C ∈ S<sup>0</sup> w.r.t. R<sup>S</sup>∞, E<sup>S</sup>∞, C σ is true in I<sup>S</sup><sup>∞</sup> by the previous paragraph, and hence C is also true in I<sup>S</sup>∞.

We may now present the proof that BP with our contraction rules is refutationally complete.

**Proof of Theorem 20** Let S0, S1,... be a fair theorem proving derivation w.r.t. BP such that S<sup>0</sup> is a set of unconstrained clauses. If the empty clause is in some S<sup>j</sup> , then S<sup>0</sup> ∪ E is unsatisfiable by the soundness of BP. Otherwise, if the empty clause is not in S<sup>k</sup> for all k, then by the soundness of BP, S<sup>∞</sup> does not contain the empty clause, and hence S<sup>0</sup> ∪ E is satisfiable by Theorem 27.

## **7 Conclusion**

We have presented a basic paramodulation calculus modulo and provided a framework for equational theorem proving modulo equational theories E satisfying some properties of E using constrained clauses, where a constrained clause may schematize a set of unconstrained clauses by keeping E-unification problems in its constraint part. Our results imply that we can deal uniformly with different equational theories E in our equational theorem proving modulo framework. We only need a single refutational completeness proof for our basic paramodulation calculus modulo E for different equational theories E.

Our contraction techniques (i.e. basic E-simplification and basic E-blocking) for constrained clauses can also be applied uniformly for different equational theories E satisfying some properties of E in our equational theorem proving modulo framework. Since a constrained clause may schematize a set of unconstrained clauses, the simplification or deletion of a constrained clause may correspond to the simplification or deletion of a set of unconstrained clauses. We have proposed a saturation procedure for constrained clauses based on relative reducibility and showed the refutational completeness of our inference system using a saturated clause set (w.r.t. ).

Some possible improvements remain to be done. One of the main issues is the broadening the scope of our equational theorem proving modulo E to more equational theories E. This can be achieved by dropping or weakening some ordering requirements of (e.g. monotonicity of ) for a basic paramodulation calculus modulo E, while maintaining the refutational completeness of the calculus (cf. [10]). This can also be achieved by finding suitable E-compatible orderings for more equational theories E. In fact, we provided an E-compatible simplification ordering on terms that is E-total on ground terms for finite permutation theories E in [17], which allows us to provide a refutationally complete equational theorem proving with built-in permutation theories using the results of this paper. Since permutations play an important role in mathematics and many fields of science including computer science, we believe that developing applications for equational theorem proving with built-in permutation theories is another promising future research direction.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Unifying Decidable Entailments in Separation Logic with Inductive Definitions

Mnacho Echenim<sup>1</sup> , Radu Iosif2 , and Nicolas Peltier<sup>1</sup>

<sup>1</sup> Univ. Grenoble Alpes, CNRS, LIG, F-38000 Grenoble France <sup>2</sup> Univ. Grenoble Alpes, CNRS, VERIMAG, F-38000 Grenoble France

Abstract. The entailment problem ϕ |= ψ in Separation Logic [12,15], between separated conjunctions of equational (*x* ≈ *y* and *x* ≈ *y*), spatial (*x* → (*y*1,..., *y*κ)) and predicate (*p*(*x*1,..., *xn*)) atoms, interpreted by a finite set of inductive rules, is undecidable in general. Certain restrictions on the set of inductive definitions lead to decidable classes of entailment problems. Currently, there are two such decidable classes, based on two restrictions, called *establishment* [10,13,14] and *restrictedness* [8], respectively. Both classes are shown to be in 2EXPTIME by the independent proofs from [14] and [8], respectively, and a many-one reduction of established to restricted entailment problems has been given [8]. In this paper, we strictly generalize the restricted class, by distinguishing the conditions that apply only to the left- (ϕ) and the right- (ψ) hand side of entailments, respectively. We provide a many-one reduction of this generalized class, called *safe*, to the established class. Together with the reduction of established to restricted entailment problems, this new reduction closes the loop and shows that the three classes of entailment problems (respectively established, restricted and safe) form a single, unified, 2EXPTIME-complete class.

## 1 Introduction

Separation Logic [12,15] (SL) was primarily introduced for writing concise Hoare logic proofs of programs that handle pointer-linked recursive data structures (lists, trees, etc). Over time, SL has evolved into a powerful logical framework, that constitutes the basis of several industrial-scale static program analyzers [3,2,5], that perform scalable compositional analyses, based on the principle of *local reasoning*: describing the behavior of a program statement with respect only to the small (local) set of memory locations that are changed by that statement, with no concern for the rest of the program's state.

Given a set of memory locations (e.g., addresses), SL formulæ describe *heaps*, that are finite partial functions mapping finitely many locations to records of locations. A location is *allocated* if it occurs in the domain of the heap. An atom *x* → (*y*1,..., *y*κ) states that there is only one allocated location, associated with *x*, that moreover refers to the tuple of locations associated with (*y*1,..., *y*κ), respectively. The *separating conjunction* φ ∗ψ states that the heap can split into two parts, with disjoint domains, that make φ and ψ true, respectively. The separating conjunction is instrumental in supporting local reasoning, because the disjointness between the (domains of the) models of its arguments ensures that no update of one heap can actually affect the other.

Reasoning about recursive data structures of unbounded sizes (lists, trees, etc.) is possible via the use of predicate symbols, whose interpretation is specified by a userprovided *set of inductive definitions* (SID) of the form *p*(*x*1,..., *xn*) ⇐ π, where *p* is a predicate symbol of arity *n* and the free variables of the formula π are among the parameters *x*1,..., *xn* of the rule. Here the separating conjunction ensures that each unfolding of the rules, which substitute some predicate atom *p*(*y*1,..., *yn*) by a formula π[*x*1/*y*1,..., *xn*/*yn*], corresponds to a way of building the recursive data structure. For instance, a list is either empty, in which case its head equals its tail pointer, or is built by first allocating the head, followed by all elements up to but not including the tail, as stated by the inductive definitions ls(*x*, *y*) ⇐ *x* ≈ *y* and ls(*x*, *y*) ⇐ ∃*z* . *x* → (*z*) ∗ ls(*z*, *y*).

An important problem in program verification, arising during the construction of Hoare-style correctness proofs of programs, is the discharge of verification conditions of the form φ |= ψ, where φ and ψ are SL formulæ, asking whether every model of φ is also a model of ψ. These problems, called *entailments*, are, in general, undecidable in the presence of inductively defined predicates [11,1].

A first decidable class of entailments, described in [10], involves three restrictions on the SID rules: *progress*, *connectivity* and *establishment*. Intuitively, the progress (P) condition states that every rule allocates exactly one location, the connectivity (C) condition states that the set of allocated locations has a tree-shaped structure, and the establishment (E) condition states that every existentially quantified variable from a rule defining a predicate is (eventually) allocated in every unfolding of that predicate. A 2EXPTIME algorithm was proposed for testing the validity of PCE entailments [13,14] and a matching 2EXPTIME-hardness lower bound was provided shortly after [6].

Later work relaxes the establishment condition, necessary for decidability [7], by proving that the entailment problem is still in 2EXPTIME if the establishment condition is replaced by the *restrictedness* (R) condition, which requires that every disequality (*x* ≈ *y*) involves at least one free variable from the left-hand side of the entailment, propagated through the unfoldings of the inductive system [8]. Interestingly, the rules of a progressive, connected and restricted (PCR) entailment may generate data structures with "dangling" (i.e. existentially quantified but not allocated) pointers, which was not possible with PCE entailments.

In this paper, we generalize PCR entailments further, by showing that the connectivity and restrictedness conditions are needed only on the right-hand side of the entailment, whereas the only condition required on the left-hand side is progress (which can usually be enforced by folding or unfolding definitions). Our results thus allow for "asymetric" entailments, i.e., one can test whether the structures described by inductive rules that are (almost) arbitrary fulfill some restricted formula. Although the class of data structures that can be described is much larger, we show that this new class of entailments, called *safe*, is also 2EXPTIME-complete, by a many-one reduction of the validity of safe entailments to the validity of PCE entailments. A second contribution of the paper is the cross-certification of the two independent proofs of the 2EXPTIME upper bounds, for the PCE [6,14,8] and PCR [8] classes of entailments, respectively, by closing the loop. Namely, the reduction given in this paper enables the translation of any of the three entailment problems into an equivalent problem in any other class, while preserving the 2EXPTIME upper bound. This is because all the reductions are polynomial in the overall size of the SID and singly-exponential in the maximum size of the rules in the SID. The theoretical interest of the reduction is that it makes the proof of decidability and of the complexity class much shorter and clearer. It also has some practical advantages, since it allows one to re-use existing implementations designed for established systems instead of having to develop entirely new automated reasoning systems. Due to space restrictions, some of the proofs are omitted. All proofs can be found in [9].

## 2 Definitions

For a (partial) function *f* : *A* → *B*, we denote by dom(*f*) and rng(*f*) its domain and range, respectively. For a relation *R* ⊆ *A*×*A*, we denote by *R*<sup>∗</sup> the reflexive and transitive closure of *R*.

Let κ be a fixed natural number throughout this paper and let P be a countably infinite set of *predicate symbols*. Each predicate symbol *p* ∈ P is associated a unique arity, denoted *ar*(*p*). Let V be a countably infinite set of *variables*. For technical convenience, we also consider a special constant ⊥, which will be used to denote "empty" record fields. Formulæ are built inductively, according to the following syntax:

$$\Phi := \mathbf{x} \not\simeq \mathbf{x}' \mid \mathbf{x} \approx \mathbf{x}' \mid \mathbf{x} \mapsto (\mathbf{y}\_1, \dots, \mathbf{y}\_\mathbf{x}) \mid p(\mathbf{x}\_1, \dots, \mathbf{x}\_n) \mid \Phi\_1 \* \Phi\_2 \mid \Phi\_1 \vee \Phi\_2 \mid \exists \mathbf{x} \ \mathbf{.} \Phi\_1 \mathbf{x}$$

where *p* ∈ P is a predicate symbol of arity *n* = *ar*(*p*), *x*, *x* ,*x*1,..., *xn* ∈ V are variables and *y*1,..., *y*<sup>κ</sup> ∈ V∪ {⊥} are *terms*, i.e. either variables or ⊥.

The set of variables freely occurring in a formula φ is denoted by fv(φ), we assume by α-equivalence that the same variable cannot occur both free and bound in the same formula φ, and that distinct quantifiers bind distinct variables. The *size* |φ| of a formula φ is the number of occurrences of symbols in φ. A formula *x* ≈ *x* or *x* ≈ *x* is an *equational atom*, *x* → (*y*1,..., *y*κ) is a *points-to atom*, whereas *p*(*x*1,..., *xn*) is a *predicate atom*. Note that ⊥ cannot occur in an equational or in a predicate atom. A formula is *predicate-less* if no predicate atom occurs in it. A *symbolic heap* is a formula of the form <sup>∃</sup>*<sup>x</sup>* . ∗*<sup>m</sup> <sup>j</sup>*=1α*i*, where each α*<sup>i</sup>* is an atom and *x* is a possibly empty vector of variables.

Definition 1. *A variable x is* allocated by a symbolic heap φ *iff* φ *contains a sequence of equalities x*<sup>1</sup> ≈ *x*<sup>2</sup> ≈ ... ≈ *xn*−<sup>1</sup> ≈ *xn, for n* ≥ 1*, such that x* = *x*<sup>1</sup> *and xn* → (*y*1,..., *y*κ) *occurs in* φ*, for some variables x*1,..., *xn and some terms y*1,..., *y*<sup>κ</sup> ∈ V∪ {⊥}*.*

A *substitution* is a partial function mapping variables to variables. If σ is a substitution and φ is a formula, a variable or a tuple, then φσ denotes the formula, the variable or the tuple obtained from φ by replacing every free occurrence of a variable *x* ∈ dom(σ) by <sup>σ</sup>(*x*), respectively. We denote by {*xi*, *yi* | *<sup>i</sup>* <sup>∈</sup> -<sup>1</sup>,*n*} the substitution with domain {*x*1,..., *xn*} that maps *xi* to *yi*, for each *<sup>i</sup>* <sup>∈</sup> -<sup>1</sup>,*n*.

A *set of inductive definitions* (SID) *R* is a finite set of implications (or rules) of the form *p*(*x*1,..., *xn*) ⇐ π, where *p* ∈ P, *n* = *ar*(*p*), *x*1,..., *xn* are pairwise distinct variables and π is a quantifier-free symbolic heap. The predicate atom *p*(*x*1,..., *xn*) is the *head* of the rule and *R* (*p*) denotes the subset of *R* consisting of rules with head *p*(*x*1,..., *xn*) (the choice of *x*1,..., *xn* is not important). The variables in fv(π)\ {*x*1,..., *xn*} are called the *existential variables of the rule*. Note that, by definition, these variables are not explicitly quantified inside π and that π is quantifier-free. For simplicity, we denote by *p*(*x*1,..., *xn*) ⇐*<sup>R</sup>* π the fact that the rule *p*(*x*1,..., *xn*) ⇐ π belongs to *R* . The *size* of *R* is defined as |*R* | def <sup>=</sup> <sup>∑</sup>*p*(*x*1,...,*xn*)⇐*<sup>R</sup>* <sup>π</sup> <sup>|</sup>π|+*<sup>n</sup>* and its *width* as w(*<sup>R</sup>* ) def = max*p*(*x*1,...,*xn*)⇐*<sup>R</sup>* <sup>π</sup> |π|+ *n*.

We write *p <sup>R</sup> q*, *p*,*q* ∈ P iff *R* contains a rule of the form *p*(*x*1,..., *xn*) ⇐ π, and *q* occurs in π. We say that *p depends on q* if *p* ∗ *<sup>R</sup> q*. For a formula φ, we denote by *P*(φ) the set of predicate symbols *q*, such that *p* ∗ *<sup>R</sup> q* for some predicate *p* occurring in φ.

Given formulæ φ and ψ, we write φ ⇐*<sup>R</sup>* ψ if ψ is obtained from φ by replacing an atom *p*(*u*1,...,*un*) by π{*x*1,*u*1,...,*xn*,*un*}, where *R* contains a rule *p*(*x*1,..., *xn*) ⇐ π. We assume, by a renaming of existential variables, that the set (fv(π)\ {*x*1,..., *xn*})∩ fv(φ) is empty. We call ψ an *unfolding* of φ iff φ ⇐<sup>∗</sup> *<sup>R</sup>* ψ.

We now define the semantics of SL. Let *L* be a countably infinite set of *locations* containing, in particular, a special location '. A *structure* is a pair (s,h), where:


If *x*1,..., *xn* are pairwise distinct variables and -1,...,*<sup>n</sup>* ∈ *L* are locations, we denote by s[*xi* ← *<sup>i</sup>* | 1 ≤ *i* ≤ *n*] the store s defined by dom(s ) = dom(s)∪{*x*1,..., *xn*}, s (*y*) = *i* if *<sup>y</sup>* <sup>=</sup> *xi* for some *<sup>i</sup>* <sup>∈</sup> -<sup>1</sup>,*n*, and <sup>s</sup> (*y*) = s(*x*) otherwise. If *x*1,..., *xn* ∈ dom(s), then the store s is called an *extension* of s to {*x*1,..., *xn*}.

Given a heap h, we define ref(h) def = - *<sup>l</sup>*∈dom(h){*<sup>i</sup>* | h(-)=(-1,...,<sup>κ</sup>),*<sup>i</sup>* <sup>∈</sup> -<sup>1</sup>,κ} and loc(h) def = dom(h)∪ref(h). Two heaps h<sup>1</sup> and h<sup>2</sup> are *disjoint* iff dom(h1)∩dom(h2) = 0/, in which case h<sup>1</sup> h<sup>2</sup> denotes the union of h<sup>1</sup> and h2, undefined whenever h<sup>1</sup> and h<sup>2</sup> are not disjoint.

Given an SID *R* , (s,h) |=*<sup>R</sup>* φ is the least relation between structures and formulæ such that whenever (s,h) |=*<sup>R</sup>* φ, we have fv(φ) ⊆ dom(s) and the following hold:


Given formulæ φ and ψ, we write φ |=*<sup>R</sup>* ψ whenever (s,h) |=*<sup>R</sup>* φ ⇒ (s,h) |=*<sup>R</sup>* ψ, for all structures (s,h) and φ ≡*<sup>R</sup>* ψ for (φ |=*<sup>R</sup>* ψ and ψ |=*<sup>R</sup>* φ). We omit the subscript *R* whenever these relations hold for any SID. It is easy to check that, for all formulæ φ1,φ2,ψ, it is the case that (φ1∨φ2)∗ψ ≡ (φ<sup>1</sup> ∗ψ)∨(φ<sup>2</sup> ∗ψ) and (∃*x*.φ1)∗φ<sup>2</sup> ≡ ∃*x* . φ<sup>1</sup> ∗ φ2. Consequently, each formula can be transformed into an equivalent finite disjunction of symbolic heaps.

Definition 2. *An* entailment problem *is a triple* P def = φ *<sup>R</sup>* ψ*, where* φ *is a quantifierfree formula,* ψ *is a formula and R is an SID. The problem* P *is* valid *iff* φ |=*<sup>R</sup>* ψ*. The* size *of the problem* P *is defined as* |P| def = |φ| + |ψ| + |*R* | *and its* width *is defined as* w(P) def = max(|φ|,|ψ|,w(*R* ))*.*

Note that considering φ to be quantifier-free loses no generality, because ∃*x*.φ |=*<sup>R</sup>* ψ ⇐⇒ φ |=*<sup>R</sup>* ψ.

# 3 Decidable Entailment Problems

The class of general entailment problems is undecidable, see Theorem 5 below for a refinement of the initial undecidability proofs [11,1]. A first attempt to define a natural decidable class of entailment problems is described in [10] and involves three restrictions on the SID rules, formally defined below:

Definition 3. *A rule p*(*x*1,..., *xn*) ⇐ π *is:*


*An SID R is* P *(resp.* C*,* E*)* for a formula φ *iff every rule in* - *<sup>p</sup>*∈*P*(φ)*R* (*p*) *is P (resp. C,E). An entailment problem* φ *<sup>R</sup>* ψ *is* left- *(resp.* right-*)* P *(resp.* C*,* E*) iff R is P (resp. C, E) for* φ *(resp.* ψ*). An entailment problem is* P *(resp.* C*,* E*) iff it is both left- and right-P (resp. C, E).*

The decidability of progressing, connected and left-established entailment problems is an immediate consequence of the result of [10]. Moreover, an analysis of the proof [10] leads to an elementary recursive complexity upper bound, which has been recently tighten down to 2EXPTIME-complete [14,8,6]. In the following, we refer to Table 1 for a recap of the complexity results for the entailment problem. The last line is the main result of the paper and corresponds to the most general (known) decidable class of entailment problems (Definition 8).

Table 1. Decidability and Complexity Results for the Entailment Problem ( means that the corresponding condition holds on the left- and right-hand side of the entailment)


The following theorem is an easy consequence of previous results [6].

Theorem 4. *The progressing, connected and left-established entailment problem is* 2*EXPTIME-complete. Moreover, there exists a decision procedure that runs in time* 22*<sup>O</sup>*(w(P)8·log|P|) *for every instance* P *of this problem.*

A natural question arises in this context: which of the restrictions from the above theorem can be relaxed and what is the price, in terms of computational complexity, of relaxing (some of) them? In the light of Theorem 5 below, the connectivity restriction cannot be completely dropped. Further, if we drop the establishment condition, the problem becomes undecidable [7, Theorem 6], even if both the left/right progress and connectivity conditions apply.

Theorem 5. *The progressing, left-connected and established entailment problem is undecidable.*

The second decidable class of entailment problems [8] relaxes the connectivity condition and replaces the establishment with a syntactic condition (that can be checked in polynomial time in the size of the SID), while remaining 2EXPTIME-complete. Informally, the definition forbids (dis)equations between existential variables in symbolic heaps or rules: the only allowed (dis)equations are of the form *x y* where *x* is a free variable (viewed as a constant in [8]). The definition given below is essentially equivalent to that of [8], but avoids any reference to constants; instead it uses a notion of *R* -positional functions, which helps to identify existential variables that are always replaced by a free variable from the initial formula during unfolding.

An *R -positional function* maps every *n*-ary predicate symbol *p* occurring in *R* to a subset of -<sup>1</sup>,*n*. Given an *<sup>R</sup>* -positional function <sup>λ</sup> and a formula <sup>φ</sup>, we denote by <sup>V</sup>λ(φ) the set of variables *xi* such that φ contains a predicate atom *p*(*x*1,..., *xn*) with *i* ∈ λ(*p*). Note that V<sup>λ</sup> is stable under substitutions, i.e. Vλ(φσ)=(Vλ(φ))σ, for each formula φ and each substitution σ.

Definition 6. *Let* ψ *be a formula and R be an SID. The* fv-profile *of the pair* (ψ,*R* ) *is the R -positional function* λ *such that the sets* λ(*p*)*, for p* ∈ P*, are the maximal sets satisfying the following conditions:*


*The fv-profile of* (ψ,*R* ) *is denoted by* λ<sup>ψ</sup> *R .*

Intuitively, given a predicate *<sup>p</sup>* <sup>∈</sup> <sup>P</sup>, the set λψ *<sup>R</sup>* (*p*) denotes the formal parameters of *p* that, in every unfolding of ψ, will always be substituted by variables occurring freely in ψ. It is easy to check that λ<sup>ψ</sup> *<sup>R</sup>* can be computed in polynomial time w.r.t. |ψ|+|*R* |, using a straightforward greatest fixpoint algorithm. The algorithm starts with a function mapping every predicate *<sup>p</sup>* of arity *<sup>n</sup>* to -<sup>1</sup>,*n* and repeatedly removes elements from the sets λ(*p*) to ensure that the above conditions hold. In the worst case, we may have eventually λ(*p*) = 0/ for all predicate symbols *p*.

Definition 7. *Let* λ *be an R -positional function, and V be a set of variables. A formula* φ *is* λ-restricted *(*λ*-R) w.r.t. V iff the following hold:*

*1. for every disequation y* ≈ *z in* φ*, we have* {*y*,*z*} ∩*V* = 0/*, and*

*2.* Vλ(φ) ⊆ *V .*


*An SID R is* P *(resp.* λ-C*,* λ-R*)* for a formula φ *iff every rule in* - *<sup>p</sup>*∈*P*(φ)*R* (*p*) *is P (resp.* λ*-C,* λ*-R).*

*An SID R is* λ-C *(*λ-R*)* for a formula φ *iff every rule in* - *<sup>p</sup>*∈*P*(φ)*R* (*p*) *is* <sup>λ</sup>*-C (*λ*-R). An entailment problem* φ *<sup>R</sup>* ψ *is* left*- (*right*-)* λ-C*, (*λ-R*) iff R is* λ*-C (*λ*-R) for* φ *(*ψ*), where* λ *is considered to be* λφ *<sup>R</sup> (*λψ *<sup>R</sup> ). An entailment problem is* λ-C *(*λ-R*) iff it is both left- and right-*λ*-C (*λ*-R).*

The class of progressing, λ-connected and λ-restricted entailment problems has been shown to be a generalization of the class of progressing, connected and left-established problems, because the latter can be reduced to the former by a many-one reduction [8, Theorem 13] that runs in time <sup>|</sup>P| · <sup>2</sup>*O*(w(P)2) on input <sup>P</sup> (Figure 1) and preserves the problem's width asymptotically.

In the rest of this paper we close the loop by defining a syntactic extension of λprogressing, λ-connected and λ-restricted entailment problems and by showing that this extension can be reduced to the class of progressing, connected and left-established entailment problems by a many-one reduction. The new fragment is defined as follows:

Definition 8. *An entailment problem* <sup>φ</sup> *<sup>R</sup>* <sup>ψ</sup> *is* safe *if, for* <sup>λ</sup> def = λψ *<sup>R</sup> , the following hold:*


Note that there is no condition on the formula φ, or on the rules defining the predicates occurring only in φ, other than the progress condition. The conditions in Definition 8 ensure that all the disequations occurring in any unfolding of ψ involve at least one variable that is free in φ. Further, the heaps of the model of ψ must be *forests*, i.e. unions of trees, the roots of which are associated with the first argument of the predicate atoms in ψ or to free variables from φ.

A typical yet very simple example of such an entailment is the so-called "reversed list" problem that consists in checking that any list segment revls(*z*,*y*) defined in the reverse direction (from the tail to the head) is a list segment ls(*x*, *y*) in the usual sense (defined inductively from head to tail). This corresponds to the entailment problem revls(*z*, *y*) *<sup>R</sup>* ∃*x*.ls(*x*, *y*) where *R* contains the following rules:

$$\begin{array}{ll} \mathbf{1s}(\mathbf{x},\mathbf{y}) \Leftarrow\mathbf{x} \mapsto (\mathbf{y}) & \mathbf{r}\mathbf{e}\mathbf{v}\mathbf{1s}(\mathbf{z},\mathbf{y}) \Leftarrow\mathbf{z} \mapsto (\mathbf{y})\\ \mathbf{1s}(\mathbf{x},\mathbf{y}) \Leftarrow\mathbf{x} \mapsto (\mathbf{z}) \ast \mathbf{1s}(\mathbf{z},\mathbf{y}) & \mathbf{r}\mathbf{e}\mathbf{v}\mathbf{1s}(\mathbf{z},\mathbf{y}) \Leftarrow\mathbf{z} \mapsto (\mathbf{y}) \ast \mathbf{r}\mathbf{e}\mathbf{v}\mathbf{1s}(\mathbf{u},\mathbf{z}) \end{array}$$

This problem is considered as challenging for proof search-based automated reasoning procedures (see, e.g., [4,16]). The antecedent does not fulfill the connectivity condition, but the subsequent does, hence the entailment is safe. Similar, more complex examples can be defined, for instance a list can be constructed by interleaving elements at odd or even positions. Another example is the case of a data structure containing an unbounded number of acylic lists (e.g., a list of acyclic lists). Such a data structure does not fulfill the restricteness condition, since one needs to compare the pointers occurring along each list to the point at the end. Checking, for instance, that the concatenation of two lists of acyclic lists is again a list of (possibly cyclic) lists is a problem that fits into the safe class and can thus be effectively checked by our algorithm.

We refer the reader to Figure 1 for a general picture of the entailment problems considered so far and of the many-one reductions between them, where the reduction corresponding to the dashed arrow is the concern of the next section. Importantly, since all reductions are many-one, taking time polynomial in the size and exponential in the width of the input problem, while preserving its width asymptotically, the three classes from Figure 1 can be unified into a single (2EXPTIME-complete) class of entailments.

## 4 Reducing Safe to Established Entailments

In a model of a safe SID (Definition 8), the existential variables introduced by the replacement of predicate atoms with corresponding rule bodies are not required to be allocated. This is because safe SIDs are more liberal than established SIDs and allow heap structures with an unbounded number of dangling pointers. As observed in [8], checking the validity of an entailment (w.r.t a restricted SID) can be done by considering only those structures in which the dangling pointers point to pairwise distinct locations. The main idea of the hereby reduction of safe to established entailment problems is that any such structure can be extended by allocating all dangling pointers separately and, moreover, the extended structures can be defined by an established SID.

In what follows, we fix an arbitrary instance P = φ *<sup>R</sup>* ψ of the safe entailment problem (Definition 8) and denote by λ def = λψ *<sup>R</sup>* the fv-profile of (ψ,*R* ) (Definition 6). Let *w* def = (*w*1,...,*w*ν) be the vector of free variables from φ and ψ, where the order of variables is not important and assume w.l.o.g. that ν > 0. Let *P<sup>l</sup>* def = *P*(φ) and *Pr* def = *P*(ψ) be the sets of predicate symbols that depend on the predicate symbols occurring in the left- and right-hand side of the entailment, respectively. We assume that φ and ψ contain no points-to atoms and that *P<sup>l</sup>* ∩*P<sup>r</sup>* = 0/. Again, these assumptions lose no generality, because a points-to atom *u* → (*v*1,..., *v*κ) can be replaced by a predicate atom *p*(*u*, *v*1,..., *v*κ), where *p* is a fresh predicate symbol associated with the rule *p*(*x*,*y*1,..., *y*κ) ⇐ *x* → (*y*1,..., *y*κ). Moreover the condition *P<sup>l</sup>* ∩*P<sup>r</sup>* = 0/ may be enforced by considering two copies of each predicate, for the left-hand side and for the right-hand side, respectively. Finally, we assume that every rule contains exactly *μ* existential variables, for some fixed *<sup>μ</sup>* <sup>∈</sup> <sup>N</sup>; this condition can be enforced by adding dummy literals *x* ≈ *x* if needed.

We describe a reduction of P to an equivalent progressing, connected, and leftestablished entailment problem. The reduction will extend heaps, by adding ν+*μ* record fields. We shall therefore often consider heaps and points-to atoms having κ + ν + *μ* record fields, where the formal definitions are similar to those given previously. Usually such formulæ and heaps will be written with a prime. These additional record fields will be used to ensure that the constructed system is connected, by adding all the existential variables of a given rule (as well as the variables in *w*1,...,*w*ν) into the image of the location allocated by the considered rule. Furthermore, the left-establishment condition will be enforced by adding predicates and rules in order to allocate all the locations that correspond to existential quantifiers and that are not already allocated, making such locations point to a dummy vector <sup>⊥</sup> def = (⊥,...,⊥), of length <sup>κ</sup>+ν+*μ*, where <sup>⊥</sup> is the special constant denoting empty heap entries. To this aim, we shall use a predicate symbol ⊥ associated with the rule ⊥(*x*) ⇐ *x* → ⊥. Note that allocating all these locations will entail (by definition of the separating conjunction) that they are distinct, thus the addition of such predicates and rules will reduce the number of satisfiable unfoldings. However, due to the restrictions on the use of disequations3, we shall see that this does not change the status of the entailment problem.

Definition 9. *For any total function* γ : *L* → *L and any tuple* - = -1,...,*<sup>n</sup>* ∈ *Ln, we denote by* γ(-) *the tuple* γ(-<sup>1</sup>),..., γ(*<sup>n</sup>*)*. If* s *is a store, then* γ(s) *denotes the store with domain* dom(s)*, such that* γ(s)(*x*) def = γ(s(*x*))*, for all x* ∈ dom(s)*. Consider a heap* h *such that for all* - = - ∈ dom(h)*, we have* γ(-) = γ(- )*. Then* γ(h) *denotes the heap with domain* dom(γ(h)) = {γ(-) | - ∈ dom(h)}*, such that* γ(h)(γ(-)) def = γ(h(-))*, for all* -∈ dom(h)*.*

The following lemma identifies conditions ensuring that the application of a mapping to a structure (Definition 9) preserves the truth value of a formula.

Lemma 10. *Given a set of variables V , let* α *be a formula that is* λ*-restricted w.r.t. V , such that P*(α) ⊆ *P<sup>r</sup> and let* (s,h) *be an R -model of* α*. For every mapping* γ : *L* → *L such that* γ(-) = γ(- ) ⇒ - = - *holds whenever either* {-, - } ⊆ dom(h) *or* {-, - } ∩s(*V*) = 0/*, we have* (γ(s), γ(h)) |=*<sup>R</sup>* α*.*

If γ is, moreover, injective, then the result of Lemma 10 holds for any formula:

Lemma 11. *Let* α *be a formula and let* (s,h) *be an R -model of* α*. For every injective mapping* γ : *L* → *L we have* (γ(s), γ(h)) |=*<sup>R</sup>* α*.*

<sup>3</sup> Point (1) of Definition 7 in conjunction with point (2) of Definition 8.

#### Fig. 2. Heap Expansion and Truncation

#### 4.1 Expansions and Truncations

We introduce a so-called *expansion* relation on structures, as well as a *truncation* operation on heaps. Intuitively, the expansion of a structure is a structure with the same store and whose heap is augmented with new allocated locations (each pointing to ⊥) and additional record fields, referring in particular to all the newly added allocated locations. These locations are introduced to accommodate all the existential variables of the predicate-less unfolding of the left-hand side of the entailment (to ensure that the obtained entailment is left-established). Conversely, the truncation of a heap is the heap obtained by removing these extra locations. We also introduce the notion of a γ-expansion which is a structure whose image by γ is an expansion.

We recall that, throughout this and the next sections, *w* = (*w*1,...,*w*ν) denotes the vector of free variables occurring in the problem, which is assumed to be fixed throughout this section and that {*w*1,...,*w*ν,⊥} ⊆ dom(s), for every store s considered here. Moreover, we assume w.l.o.g. that *w*1,...,*w*<sup>ν</sup> do not occur in the considered SID *R* and denote by *μ* the number of existential variables in each rule of *R* . We refer to Figure 2 for an illustration of the definition below:

Definition 12. *Let* γ : *L* → *L be a total mapping. A structure* (s,h ) *is a* γ-expansion *(or simply an* expansion *if* γ = *id) of some structure* (s,h)*, denoted by* (s,h ) <sup>γ</sup> (s,h)*, if* <sup>h</sup> : *L* <sup>→</sup> *L*<sup>κ</sup>*,* <sup>h</sup> : *L* <sup>→</sup> *L*<sup>κ</sup>+*μ*+<sup>ν</sup> *and there exist two disjoint heaps,* main(<sup>h</sup> ) *and* aux(h )*, such that* h = main(h )aux(h ) *and the following hold:*


Let(s,h ) be a γ-expansion of(s,h) and let - ∈ dom(main(h )) be a location. Since ν > 0 and for all *<sup>i</sup>* <sup>∈</sup> -<sup>1</sup>,ν, <sup>s</sup>(*wi*) occurs in <sup>h</sup> (-), and since we assume that <sup>s</sup>(*wi*) <sup>=</sup> ' <sup>=</sup> <sup>s</sup>(⊥) for every *<sup>i</sup>* <sup>∈</sup> -<sup>1</sup>,ν, necessarily main(<sup>h</sup> )(-) <sup>=</sup> '. This entails that the decomposition

<sup>4</sup> Note that - does not depend on γ, and if several such locations exist, then one is chosen arbitrarily.

h = main(h )aux(h ) is unique: main(h ) and aux(h ) are the restrictions of h to the locations in dom(h ) such that h (-) =' and <sup>h</sup> (-) =', respectively. In the following, we shall thus freely use the notations aux(h ) and main(h ), for arbitrary heaps h .

Definition 13. *Given a heap* h *, we denote by* trunc(h ) *the heap* h *defined as follows:* dom(h) def = dom(h ) \ {- ∈ dom(h ) | h (-) = '} *and for all* - ∈ dom(h)*, if* h (-) = (-1,...,<sup>κ</sup>+ν+*μ*)*, then* h(-) def = (-1,...,κ)*.*

Note that, if h = trunc(h ) then <sup>h</sup> : *L* <sup>→</sup> *L*<sup>κ</sup> and <sup>h</sup> : *L* <sup>→</sup> *L*<sup>κ</sup>+*μ*+<sup>ν</sup> are heaps of different out-degrees. In the following, we silently assume this fact, to avoid cluttering the notation by explicitly specifying the out-degree of a heap.

*Example 14.* Assume that *L* = N, ν = *μ* = 1. Let s be a store such that s(*w*1) = 0. We consider:

$$\begin{aligned} \mathfrak{h} & \stackrel{\text{def}}{=} \{ \langle 1, 2 \rangle, \langle 2, 2 \rangle \}, \\ \mathfrak{h}'\_1 & \stackrel{\text{def}}{=} \{ \langle 1, (2, 0, 1) \rangle, \langle 2, (2, 0, 3) \rangle, \langle 3, (\bot, \bot, \bot) \rangle \}, \\ \mathfrak{h}'\_2 & \stackrel{\text{def}}{=} \{ \langle 1, (3, 0, 1) \rangle, \langle 2, (4, 0, 3) \rangle, \langle 3, (\bot, \bot, \bot) \rangle \}. \end{aligned}$$

We have (s,h <sup>1</sup>)*id* (s,h) and (s,h <sup>2</sup>)<sup>γ</sup> (s,h), with <sup>γ</sup> def = {1,1,2,2,3,2,4,2}. Also, trunc(h <sup>1</sup>) = {1,2,2,2} = h and trunc(h <sup>2</sup>) = {1,3,2,4}. Note that h has outdegree κ = 1, whereas h <sup>1</sup> and h <sup>2</sup> have out-degree 3.

Lemma 15. *If* (s,h ) <sup>γ</sup> (s,h) *then* h = γ(trunc(h ))*, hence* (s,h ) *id* (s,trunc(h ))*.*

The converse of Lemma 15 does not hold in general, but it holds under some additional conditions:

Lemma 16. *Consider a store* s*, let* h *be a heap and let* h def = trunc(h )*. Let D*<sup>2</sup> def = {- ∈ dom(h ) | h (-) = '} *and D*<sup>1</sup> def = dom(h ) \*D*2*. Assume that:*

*1. for every location* - ∈ *D*1*,* h(-) *is of the form* (-1,...,<sup>κ</sup>) *and* h (-) *is of the form* (-1,...,<sup>κ</sup>, s(*w*), - 1,...,- *μ*)*;*

*2. every location* - ∈ *D*<sup>2</sup> *has a connection in* h *. Then* (s,h ) *id* (s,h)*.*

#### 4.2 Transforming the Consequent

We first describe the transformation for the right-hand side of the entailment problem, as this transformation is simpler.

Definition 17. *We associate each n-ary predicate p* <sup>∈</sup> *<sup>P</sup><sup>r</sup> with a new predicate p of arity* - *<sup>n</sup>*+ν*. We denote by* <sup>α</sup> *the formula obtained from* α *by replacing every predicate atom <sup>p</sup>*(*x*1,..., *xn*) *by <sup>p</sup>*-(*x*1,..., *xn*,*w*)*, where w* = (*w*1,...,*w*ν)*.*

Definition 18. *We denote by Rthe set of rules of the form:*

$$
\widehat{p}(\mathbf{x}\_1, \dots, \mathbf{x}\_n, \mathbf{w}) \Leftarrow \mathbf{x}\_1 \mapsto (\mathbf{y}\_1, \dots, \mathbf{y}\_\mathbf{x}, \mathbf{w}, z\_1, \dots, z\_\mu) \mathbf{o} \ast \widehat{\mathbf{p}} \mathbf{o} \ast \mathop{\mathsf{F}}\_\mathbf{v} \ast \mathbf{x}
$$

*where:*


$$-\left.\xi\_{I}\stackrel{\text{def}}{=}\!\ast\_{i\in I}\underline{\sf L}(z\_{i}),\text{ with }I\subseteq\{1,\ldots,\mu\},$$

$$-\underset{\mathbf{x}\in\mathbb{R}^{\mathfrak{glat}}}{\mathbf{x}}\ast\underset{\mathbf{x}\in\operatorname{\mathfrak{glom}}(\mathfrak{g})}{\mathbf{x}}\approx\mathbf{x}\mathfrak{g}.$$

*We denote by R<sup>r</sup> the set of rules in Rthat are connected*5*.*

Note that the free variables *w* are added as parameters in the rules above, instead of some arbitrary tuple of fresh variables ω, of the same length as *w*. This is for the sake of conciseness, since these parameters ω will be systematically mapped to *w*.

*Example 19.* Assume that ψ = ∃*x* . *p*(*x*,*w*1), with ν = 1, *μ* = 1 and λ(*p*) = {2}. Assume also that *p* is associated with the rule: *p*(*u*1,*u*2) ⇐ *u*<sup>1</sup> → *u*<sup>1</sup> ∗*q*(*u*2). Observe that the rule is λ-connected, but not connected. Then dom(σ) ⊆ {*u*2}, rng(σ) ⊆ {*w*1} and *I* ⊆ {1}, so that *R*contains the following rules:

$$\begin{array}{l}(1)\ p(\mu\_{1},\mu\_{2},\boldsymbol{w}\_{1}) \Leftarrow \boldsymbol{u}\_{1} \mapsto (\boldsymbol{u}\_{1},\boldsymbol{w}\_{1},\boldsymbol{z}\_{1}) \ast q(\boldsymbol{u}\_{2})\\(2)\ p(\boldsymbol{u}\_{1},\boldsymbol{u}\_{2},\boldsymbol{w}\_{1}) \Leftarrow \boldsymbol{u}\_{1} \mapsto (\boldsymbol{u}\_{1},\boldsymbol{w}\_{1},\boldsymbol{z}\_{1}) \ast q(\boldsymbol{u}\_{2}) \ast \underline{\mathbf{L}}(\boldsymbol{z}\_{1})\\(3)\ p(\boldsymbol{u}\_{1},\boldsymbol{u}\_{2},\boldsymbol{w}\_{1}) \Leftarrow \boldsymbol{u}\_{1} \mapsto (\boldsymbol{u}\_{1},\boldsymbol{w}\_{1},\boldsymbol{z}\_{1}) \ast q(\boldsymbol{w}\_{1}) \ast \boldsymbol{u}\_{2} \approx \boldsymbol{w}\_{1}\\(4)\ p(\boldsymbol{u}\_{1},\boldsymbol{u}\_{2},\boldsymbol{w}\_{1}) \Leftarrow \boldsymbol{u}\_{1} \mapsto (\boldsymbol{u}\_{1},\boldsymbol{w}\_{1},\boldsymbol{z}\_{1}) \ast q(\boldsymbol{w}\_{1}) \ast \underline{\mathbf{L}}(\boldsymbol{z}\_{1}) \ast \boldsymbol{u}\_{2} \approx \boldsymbol{w}\_{1}\end{array}$$

Rules (1) and (2) are not connected, hence do not occur in *Rr*. Rules (3) and (4) are connected, hence occur in *Rr*. Note that (4) is established, but (3) is not.

We now relate the SIDs *R* and *R<sup>r</sup>* by the following result:

Lemma 20. *Let* α *be a formula that is* λ*-restricted w.r.t.* {*w*1,...,*w*ν} *and contains no points-to atoms, with P*(α) ⊆ *Pr. Given a store* s *and two heaps* h *and* h *, such that* (s,h ) *id* (s,h)*, we have* (s,h ) <sup>|</sup>=*R<sup>r</sup>* <sup>α</sup>*if and only if* (s,h) |=*<sup>R</sup>* α*.*

## 4.3 Transforming the Antecedent

We now describe the transformation operating on the left-hand side of the entailment problem. For technical convenience, we make the following assumption:

Assumption 21. *We assume that, for every predicate p* ∈ *Pl, every rule of the form p*(*x*1,..., *xn*) ⇐ π *in R and every atom q*(*x* <sup>1</sup>,..., *x <sup>m</sup>*) *occurring in* π*, x* <sup>1</sup> ∈ {*x*1,..., *xn*}*.*

This is without loss of generality, because every variable *x* <sup>1</sup> ∈ {*x*1,..., *xn*} can be replaced by a fresh variable *z*, while conjoining the equational atom *z* ≈ *x* <sup>1</sup> to π. Note that the obtained SID may no longer be connected, but this is not problematic, because the left-hand side of the entailment is not required to be connected anyway.

Definition 22. *We associate each pair* (*p*,*X*)*, where p* <sup>∈</sup> *<sup>P</sup>l, ar*(*p*) = *n and X* <sup>⊆</sup> -<sup>1</sup>,*n, with a fresh predicate symbol pX , such that ar*(*pX* ) = *n*+ν*. A* decoration *of a formula* α *containing no points-to atoms, such that P*(α) ⊆ *Pl, is a formula obtained by replacing each predicate atom* β def = *q*(*y*1,..., *ym*) *in* α *by an atom of the form qX*<sup>β</sup> (*y*1,..., *ym*,*w*)*, with X*<sup>β</sup> <sup>⊆</sup> -<sup>1</sup>,*m. The set of decorations of a formula* <sup>α</sup> *is denoted by D*(α)*.*

<sup>5</sup> Note that all the rules in *R*are progressing.

The role of the set *X* in a predicate atom *pX* (*x*1,..., *xn*,*w*) will be explained below. Note that the set of decorations of an atom α is always finite.

Definition 23. *We denote by D*(*R* ) *the set of rules of the form*

$$p\_X(\mathbf{x}\_1, \dots, \mathbf{x}\_n, \mathbf{w}) \Leftarrow \mathbf{x}\_1 \mapsto (\mathbf{y}\_1, \dots, \mathbf{y}\_\mathbf{x}, \mathbf{w}, z\_1, \dots, z\_\mu) \mathbf{o} \ast \mathbf{p'} \ast \forall\_{i \in I} \underline{\hspace{1cm}}(z\_i),$$

*where:*


Lemma 24. *Let* α *be a formula containing no points-to atom, with P*(α) ⊆ *Pl, and let* α *be a decoration of* α*. If* (s,h ) |=*D*(*<sup>R</sup>* ) α *and* (s,h ) *id* (s,h)*, then* (s,h) |=*<sup>R</sup>* α*.*

At this point, the set *X* for predicate symbol *pX* is of little interest: atoms are simply decorated with arbitrary sets. However, we shall restrict the considered rules in such a way that for every model (s,h) of an atom *pX* (*x*1,..., *xn*<sup>+</sup>ν), with *n* = *ar*(*p*), the set *<sup>X</sup>* denotes a set of indices *<sup>i</sup>* <sup>∈</sup> -<sup>1</sup>,*n* such that <sup>s</sup>(*xi*) <sup>∈</sup> dom(h). In other words, *<sup>X</sup>* will denote a set of formal parameters of *pX* that are allocated in every model of *pX* .

Definition 25. *Given a formula* α*, we define the set Alloc*(α) *as follows: x* ∈ *Alloc*(α) *iff* α *contains either a points-to atom of the form x* → (*y*1,..., *y*κ+*μ*+ν)*, or a predicate atom qX* (*x* <sup>1</sup>,..., *x <sup>m</sup>*+ν) *with x <sup>i</sup>* = *x for some i* ∈ *X.*

Note that, in contrast with Definition 1, we do not consider that *x* ∈ *Alloc*(α), for those variables *x* related to a variable from *Alloc*(α) by equalities.

Definition 26. *A rule pX* (*x*1,..., *xn*<sup>+</sup>ν) ⇐ π *in D*(*R* ) *with n* = *ar*(*p*) *with* ρ = *x*<sup>1</sup> → (*y*1,..., *yk*,*w*,*z*1,...,*zμ*) ∗ ρ *is* well-defined *if the following conditions hold: 1.* {*x*1} ⊆ *Alloc*(*pX* (*x*1,..., *xn*<sup>+</sup>ν)) ⊆ *Alloc*(π)*; 2.* fv(π) ⊆ *Alloc*(π)∪ {*x*1,..., *xn*<sup>+</sup>ν}*. We denote by R<sup>l</sup> the set of well-defined rules in D*(*R* )*.*

We first state an important properties of *Rl*.

Lemma 27. *Every rule in R<sup>l</sup> is progressing, connected and established.*

We now relate the systems *R* and *R<sup>l</sup>* by the following result:

Definition 28. *A store* s *is* quasi-injective *if, for all x*,*y* ∈ dom(s)*, the implication* s(*x*) = s(*y*) ⇒ *x* = *y holds whenever* {*x*, *y*} ⊆ {*w*1,...,*w*ν}*.*

Lemma 29. *Let L be an infinite subset of L. Consider a formula* α *containing no points-to atom, with P*(α) ⊆ *Pl, and let* (s,h) *be an R -model of* α*, where* s *is quasiinjective, and* (rng(s)∪loc(h))∩*L* = 0/*. There exists a decoration* α *of* α*, a heap* h *and a mapping* γ : *L* → *L such that:*

– (s,h ) <sup>γ</sup> (s,h)*,* – *if* - ∈ *L then* γ(-) = -*,* – loc(h ) \ rng(s) ⊆ *L,* – dom(aux(h )) ⊆ *L and* – (s,h ) |=*R<sup>l</sup>* α *. Furthermore, if* s(*u*) ∈ dom(h ) \ {s(*wi*) | 1 ≤ *i* ≤ ν} *then u* ∈ *Alloc*(α )*.*

#### 4.4 Transforming Entailments

We define *R* def = *R<sup>l</sup>* ∪*Rr*. We show that the instance φ *<sup>R</sup>* ψ of the safe entailment problem can be solved by considering an entailment problem on *R* involving the elements of *D*(φ) (see Definition 22). Note that the rules from *R<sup>l</sup>* are progressing, connected and established, by Lemma 27, whereas the rules from *R<sup>r</sup>* are progressing and connected, by Definition 18. Hence, each entailment problem <sup>φ</sup> *R* ψ-, where φ ∈ *D*(φ), is progressing, connected and left-established.

Lemma 30. φ |=*<sup>R</sup>* ψ *if and only if* <sup>φ</sup>∈*D*(φ) <sup>φ</sup> <sup>|</sup>=*R* ψ-*.*

*Proof.* "⇒" Assume that φ |=*<sup>R</sup>* ψ and let φ ∈ *D*(φ) be a formula, (s,h ) be an *R*- -model of φ and h def = trunc(h ). By construction, (s,h ) is an *Rl*-model of φ . By definition of *D*(φ), φ is a decoration of φ. Let *D*<sup>2</sup> def = {- ∈ dom(h ) | h (-) = '}, *<sup>D</sup>*<sup>1</sup> def = dom(h ) \*D*2, and consider a location - ∈ dom(h ). By definition, must be allocated by some rule in *Rl*. If is allocated by a rule of the form given in Definition 23, then necessarily h (-) is of the form (-1,...,<sup>κ</sup>, s(*w*), - 1,...,- *<sup>μ</sup>*) and - ∈ *D*1. Otherwise, is allocated by the predicate ⊥ and we must have - ∈ *D*<sup>2</sup> by definition of the only rule for ⊥. Since this predicate must occur within a rule of the form given in Definition 23, - necessarily occurs in the *μ* last components of the image of a location in *D*1, hence admits a connection in h . Consequently, by Lemma 16 (s,h ) *id* (s,h), and by Lemma 24, (s,h) |=*<sup>R</sup>* φ. Thus (s,h) |=*<sup>R</sup>* ψ, and by Lemma 20, (s,h ) <sup>|</sup>=*R<sup>r</sup>* <sup>ψ</sup>-, thus (s,h ) <sup>|</sup>=*R* ψ-.

"⇐" Assume that <sup>φ</sup>∈*D*(φ) <sup>φ</sup> <sup>|</sup>=*R* ψ and let (s,h) be a *R* -model of φ. Since the truth values of φ and ψ depend only on the variables in fv(φ)∪fv(ψ), we may assume, w.l.o.g., that s is quasi-injective. Consider an infinite set *L* ⊆ *L* such that (rng(s) ∪ loc(h)) ∩ *L* = 0/. By Lemma 29, there exist a heap h , a mapping γ : *L* → *L* and a decoration φ of φ such that γ(-) = for all - /∈ *L*, (s,h ) <sup>γ</sup> (s,h) and (s,h ) |= φ . Since rng(s) ∩ *L* = 0/, we also have γ(s) = s. Then (s,h ) <sup>|</sup><sup>=</sup> <sup>ψ</sup>-. Let h<sup>1</sup> def = trunc(h ). Since (s,h ) <sup>γ</sup> (s,h), by Lemma 15 we have (s,h ) *id* (s,h1), and by Lemma 20, (s,h1) |= ψ. By Lemma 15 we have h = γ(h1). Since ψ is λ-restricted w.r.t. {*w*1,...,*wn*}, we deduce by Lemma 10 that (s,h) |= ψ.

This leads to the main result of this paper:

#### Theorem 31. *The safe entailment problem is 2EXPTIME-complete.*

*Proof.* The 2EXPTIME-hard lower bound follows from [8, Theorem 32], as the class of progressing, λ-connected and λ-restricted entailment problems is a subset of the safe entailment class. For the 2EXPTIME membership, Lemma 30 describes a many-one reduction to the progressing, connected and established class, shown to be in 2EXP-TIME, by Theorem 4. Considering an instance P = φ *<sup>R</sup>* ψ of the safe class, Lemma <sup>30</sup> reduces this to checking the validity of <sup>|</sup>*D*(φ)<sup>|</sup> instances of the form <sup>φ</sup> *R* ψ-, that are all progressing, connected and established, by Lemma 27. Since a formula φ ∈ *D*(φ) is obtained by replacing each predicate atom *p*(*x*1,..., *xn*) of φ by *pX* (*x*1,..., *xn*,*w*) and there are at most 2*<sup>n</sup>* such predicate atoms, it follows that <sup>|</sup>*D*(φ)<sup>|</sup> <sup>=</sup> <sup>2</sup>*O*(w(P)). To obtain 2EXPTIME-membership of the problem, it is sufficient to show that each of the progressing, connected and established instances <sup>φ</sup> *R* ψ can be built in time <sup>|</sup>P|·2*O*(w(P)·logw(P)). First, for each <sup>φ</sup> <sup>∈</sup> *<sup>D</sup>*(φ), by Definition 22, we have <sup>|</sup><sup>φ</sup> |≤|φ|·(1+ <sup>ν</sup>) ≤ |φ|·(1+w(P)) = <sup>|</sup>φ|· <sup>2</sup>*O*(logw(P)). By Definition 17, we have <sup>|</sup>φ|≤|φ|·(1+ν) = <sup>|</sup>φ| · <sup>2</sup>*O*(logw(P)). By Definition 23, *<sup>D</sup>*(*R* ) can be obtained by enumeration in time that depends linearly of

$$|D(\mathcal{R})| \le |\mathcal{R}| \cdot 2^{\mu} \cdot (n + \mathbf{v} + \boldsymbol{\mu})^{\mathbf{v}} \le |\mathcal{R}| \cdot 2^{\mathbf{w}(\mathfrak{P}) + \mathbf{w}(\mathfrak{P}) \cdot \log \mathbf{w}(\mathfrak{P})} = |\mathfrak{P}| \cdot 2^{O(\mathbf{w}(\mathfrak{P}))}$$

This is because the number of intervals *I* is bounded by 2*<sup>μ</sup>* and the number of substitutions σ by (*n*+ν+*μ*)ν, in Definition 23. By Definition 25, checking whether a rule is well-defined can be done in polynomial time in the size of the rule, hence in 2*O*(w(P)), so the construction of *R<sup>l</sup>* takes time <sup>|</sup>P| · <sup>2</sup>*O*(w(P)logw(P)). Similarly, by Definition 23, the set *R*is constructed in time

$$|\hat{\mathcal{R}}| \le |\mathcal{R}| \cdot 2^{\mu} \cdot \mathrm{w}(\mathfrak{P})^{\mathrm{v}} \le |\mathcal{R}| \cdot 2^{\mathrm{w}}(\mathfrak{P}) \cdot 2^{\mathrm{w}(\mathfrak{P}) \cdot \log \mathrm{w}(\mathfrak{P})} = |\mathfrak{P}| \cdot 2^{O(\mathrm{w}(\mathfrak{P}))}$$

Moreover, checking that a rule in *R* is connected can be done in time polynomial in the size of the rule, hence the construction of *R<sup>r</sup>* takes time 2*O*(w(P)logw(P)). Then the entire reduction takes time 2*O*(w(P)logw(P)), which proves the 2EXPTIME upper bound for the safe class of entailments.

#### 5 Conclusion and Future Work

Together with the results of [10,14,6,8], Theorem 31 draws a clear and complete picture concerning the decidability and complexity of the entailment problem in Separation Logic with inductive definitions. The room for improvement in this direction is probably very limited, since Theorem 31 pushes the frontier quite far. Moreover, virtually any further relaxation of the conditions leads to undecidability.

A possible line of future research which could be relevant for applications would be to consider inductive rules constructing simultaneously several data structures, which could be useful for instance to handle predicates comparing two structures, but it is clear that very strong conditions would be required to ensure decidability. We are also interested in defining effective, goal-directed, proof procedures (i.e., sequent or tableaux calculi) for testing the validity of entailment problems. Thanks to the reduction devised in the present paper, it is sufficient to focus on systems that are progressing, connected and left-established. We are also trying to extend the results to entailments with formulæ involving data with infinite domains, either by considering a theory of locations (e.g., arithmetic on addresses), or, more realistically, by considering additional sorts for data.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Subformula Linking for Intuitionistic Logic with Application to Type Theory**

Kaustuv Chaudhuri

Inria & LIX/Ecole polytechnique, Palaiseau, France kaustuv.chaudhuri@inria.fr, https://chaudhuri.info

**Abstract.** Subformula linking is an interactive theorem proving technique that was initially proposed for (classical) linear logic. It is based on truth and context preserving rewrites of a conjecture that are triggered by a user indicating *links* between subformulas, which can be done by direct manipulation, without the need of tactics or proof languages. The system guarantees that a true conjecture can always be rewritten to a known, usually trivial, theorem. In this work, we extend subformula linking to intuitionistic first-order logic with simply typed lambda-terms as the term language of this logic. We then use a well known embedding of intuitionistic type theory into this logic to demonstrate one way to extend linking to type theory.

## **1 Introduction**

Suppose you want to prove a conjecture such as:

$$\left\{\forall x.\exists y. a\{f\{x\},y\}\right\} \land \left\{\forall z.\, a\{f\{f\{c\}\},z\}\supset b\{z\}\right\} \supset \exists u.\, b\{f\{u\}\}$$

or to find replacements for the ?s that would allow a dependent type such as the following to be inhabited:

Πu∶(Πx∶a. Πy∶(b x).cxy).Πv∶(Πx∶a. b x).Πw∶a. (c ? ?).

In a mainstream interactive theorem proving system you would attempt it by giving instructions to a carefully constructed proof verification engine using a *formal proof language*, often with a *read-eval-print* loop for immediate feedback. Your instructions would guide the verifier through the twists and turns of a formal derivation until it is satisfied that all formal obligations have been established. Your language of instructions could be tactics-based (such as in Coq), or it could be a programming language itself (such as in HOL-Light or Agda); it could also have a formal *structure* or be *declarative* (such as Isabelle/Isar).<sup>1</sup> Despite these superficial differences, all such systems can broadly be called *linguistic* because the internal state of the verifier can only be modified by means of the formal

<sup>1</sup> These are just illustrative examples of mainstream proof systems and should not be read as assigning them a position of privilege or authority.

<sup>©</sup> The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 200 216, 2021. https://doi.org/10.1007/978-3-030-79876-5 12 –

proof language (and the whims—or semantics, if you prefer—of the interpreter of the language).

An alternative to such a linguistic system would be a system of *direct manipulation*, wherein there is a tangible representation of the state of the verifier that one can modify directly using such tools as one's fingers, pointing devices, or eye movements. The verifier's job is then to make sure that the direct manipulation attempts are allowed when they are logically permissible and prevented when they are not. A prominent example of such a direct manipulation system is the *proof by pointing* technique [3], where mouse clicks on the representation of a proof state (in a version of Coq) are given a meaning: a click on a connective deep in a formula is interpreted as a sequence of Coq tactics that bring the connective to the top, at which point it could be made to interact with the other hypotheses or the conclusion in the usual manner.

A generalization of this idea, called *proof by linking*, was proposed in [4]. It allows the user not only to point but also to *link* different subformulas, say with a multi-touch input device or with a drag-and-drop metaphor. There are two immediate benefits of linking over pointing: (1) the surrounding context of a formula is not destroyed because the linked subformulas are not brought to the top, and (2) the interaction mode is easier to describe to complete novices. For instance, a novice could be instructed to "match the atoms" for the first example above, in which case they might start by attempting the following link:

$$\{\forall x.\exists y.\, a\{f\{x\},y\}\} \land \{\forall z.\,\overline{a\{f\{f\{c\}\},z\}\supset b\{z\}}\} \supset \exists u.\, b\{f\{u\}\}.$$

The linking procedure would interpret this link as a desire to "bring" the source atom "to" the destination atom. Without touching any other part of the conjecture except the smallest subformula containing both the source and the destination of the link, the conjecture would be *rewritten* to a different one:

$$
\exists x. \forall y. \forall z. \left( \left\{ \overline{a} \{ f(x), y \} \supset \overline{a} \{ f(f(c)), z \} \right\} \supset b \{ z \} \right) \supset \exists u. b \{ f(u) \}.
$$

The surrounding context of the link is preserved as nothing is brought to the top; instead, the source moves through the formula tree to meet the destination. The rewrites that underlie the transformation are *provability preserving*: if the rewritten conjecture is provable, then so is the original conjecture. Eventually, the conjecture (if true) would be reduced to a trivial theorem such as ⊤. Note that the novice user does not need to know *any* proof language to draw these links, not even a conceptual proof system such as the sequent calculus.

The original *proof by linking* technique was proposed for classical linear logic and freely exploited the *calculus of structures* [17]. In this paper we show how to adapt the technique to intuitionistic logics and intuitionistic type theories, where the calculus of structures is not so well behaved [18,8] (or, in the case of dependent type theory, entirely missing), and where preserving the context of the rewrites is a more delicate task. We do this by first defining the technique for intuitionistic first-order logic over λ-terms, and then we use an existing complete

(shallow) embedding of dependent type theory in this logic [6,15]. A secondary contribution is to give some insight into what a deep inference formalism might look like for dependent type theory.

# **2 Subformula Linking for Intuitionistic First-Order Logic**

This section will serve both as an introduction to the subformula linking procedure, and as evidence that the technique can be applied to intuitionistic logics. Let us do this in two phases: first for the the propositional fragment, and then extended with first-order quantification.

#### **2.1 The Propositional Fragment**

We will use the following grammar of *formulas* (written A, B, . . .), where *atomic formulas* are written in lowercase (a, b, . . .).

$$A, B, \dots \dots \coloneqq a \mid A \land B \mid \top \mid A \lor B \mid \bot \mid A \supset B$$

Following usual conventions, the connectives ∧ and ∨ are left-associative, while ⊃ is right-associative; the binding priority from strongest to weakest is ∧, ∨, ⊃.

The true formulas of this calculus can be defined in terms of derivability in a variety of formal systems such as with the sequent calculus LJ or G3ip [11]. In this paper the precise sequent calculus is not of primary concern; however, we will use the notation Γ ⊢ C where Γ is a multiset of formulas to denote that the formula C is derivable from the assumptions Γ using any such calculus.

A *positively signed formula context* (written C{}) is a formula with a single occurrence of a hole {} in the place where a positively signed subformula may occur; it is defined mutually recursively with an *negatively signed formula context* (written <sup>A</sup>{}) by the following grammar, where <sup>∗</sup> <sup>∈</sup> {∧, <sup>∨</sup>}.

$$\begin{array}{l} \mathcal{C}\{\} \ \mathrel{\mathop{:=}} = \{\} \mid A \ast \mathcal{C}\{\} \mid \mathcal{C}\{\} \ast B \mid A \supset \mathcal{C}\{\} \mid \mathcal{A}\{\} \supset B\} \\ \mathcal{A}\{\} \ \mathrel{\mathop{:=}} = A \ast \mathcal{A}\{\} \mid \mathcal{A}\{\} \ast B \mid A \supset \mathcal{A}\{\} \mid \mathcal{C}\{\} \supset B \end{array}$$

The *replacement* of the hole in <sup>C</sup>{} (resp. <sup>A</sup>{}) with a formula <sup>A</sup> yields a new formula, which we write as <sup>C</sup>{A} (resp. <sup>A</sup>{A}). For instance, if <sup>C</sup>{} is <sup>a</sup> <sup>∧</sup> ((<sup>b</sup> <sup>⊃</sup> {}) <sup>∨</sup> <sup>d</sup>), then <sup>C</sup>{<sup>c</sup> <sup>⊃</sup> <sup>⊥</sup>} is <sup>a</sup> <sup>∧</sup> ((<sup>b</sup> <sup>⊃</sup> (<sup>c</sup> <sup>⊃</sup> <sup>⊥</sup>)) <sup>∨</sup> <sup>d</sup>).

**Theorem 1.** *Suppose that* A ⊢ B*. Then:*


*Proof.* Induction on the structure of the contexts C{} or A{}. ⊓⊔

In order to define the subformula linking procedure for this calculus, we work with *interaction formulas*; an interaction formula is a formula where:

Terminal rules

$$\frac{\mathcal{C}\{\mathsf{T}\}}{\mathcal{C}\{a \rhd a\}} \text{ in } \quad \frac{\mathcal{C}\{A \supset B\}}{\mathcal{C}\{A \rhd B\}} \text{ \text{ } \mathsf{rel}$$

(the conclusion of rel is understood as not overlapping that of in) Positively signed rules

C{(A ▹ B) ∧ F} C{A ▹ (B ∧ F)} ▹∧<sup>1</sup> C{F ∧ (A ▹ B)} C{A ▹ (F ∧ B)} ▹∧<sup>2</sup> <sup>C</sup>{(<sup>A</sup> ▹ <sup>B</sup>) <sup>∧</sup> (<sup>F</sup> <sup>⊃</sup> <sup>B</sup>)} C{(A ∨ F) ▹ B} ∨▹<sup>1</sup> <sup>C</sup>{(<sup>F</sup> <sup>⊃</sup> <sup>B</sup>) <sup>∧</sup> (<sup>A</sup> ▹ <sup>B</sup>)} C{(F ∨ A) ▹ B} ∨▹<sup>2</sup> <sup>C</sup>{(<sup>A</sup> ◦ <sup>B</sup>) <sup>⊃</sup> <sup>F</sup>} <sup>C</sup>{<sup>A</sup> ▹ (<sup>B</sup> <sup>⊃</sup> <sup>F</sup>)} ▹⊃<sup>1</sup> <sup>C</sup>{<sup>F</sup> <sup>⊃</sup> (<sup>A</sup> ▹ <sup>B</sup>)} <sup>C</sup>{<sup>A</sup> ▹ (<sup>F</sup> <sup>⊃</sup> <sup>B</sup>)} ▹⊃<sup>2</sup> C{A ▹ B} C{A ▹ (B ∨ F)} ▹∨<sup>1</sup> C{A ▹ B} C{A ▹ (F ∨ B)} ▹∨<sup>2</sup> C{A ▹ B} C{(A ∧ F) ▹ B} ∧▹<sup>1</sup> C{A ▹ B} C{(F ∧ A) ▹ B} ∧▹<sup>2</sup> C{F ∧ (A ▹ B)} <sup>C</sup>{(<sup>F</sup> <sup>⊃</sup> <sup>A</sup>) ▹ <sup>B</sup>} <sup>⊃</sup>▹ Negatively signed rules A{(A ◦ B) ∨ F} A{A ◦ (B ∨ F)} ◦∨<sup>1</sup> A{F ∨ (A ◦ B)} A{A ◦ (F ∨ B)} ◦∨<sup>2</sup> A{A ◦ B} A{A ◦ (B ∧ F)} ◦∧<sup>1</sup> A{A ◦ B} A{A ◦ (F ∧ B)} ◦∧<sup>2</sup> <sup>A</sup>{(<sup>A</sup> ▹ <sup>B</sup>) <sup>⊃</sup> <sup>F</sup>} <sup>A</sup>{<sup>A</sup> ◦ (<sup>B</sup> <sup>⊃</sup> <sup>F</sup>)} ◦⊃<sup>1</sup> <sup>A</sup>{<sup>F</sup> <sup>⊃</sup> (<sup>A</sup> ◦ <sup>B</sup>)} <sup>A</sup>{<sup>A</sup> ◦ (<sup>F</sup> <sup>⊃</sup> <sup>B</sup>)} ◦⊃<sup>2</sup> (plus all the symmetric variants)

**Fig. 1.** Inference rules for interaction formulas


We will define an inference system for interaction formulas that consist of inference rules with a single conclusion and a single premise, both of which are either formulas or interaction formulas. The inference rule represents an admissible rule of intuitionistic logic: if the premise is a theorem, then so is the conclusion. The full collection of rules is shown in fig. 1. There are three kinds of rules, explained below in an upwards (conclusion to premises) reading.

**–** *Terminal rules* are used to terminate a ▹-interaction in a positively signed context. In the case where the ▹-interaction links two occurrences of the


**Fig. 2.** Link creation, contraction, and simplification. The conclusion in each case must not be an interaction formula.

same atom, the result is ⊤; otherwise the ▹ turns back into ⊃. These are the only rules that can transition out of interaction formulas.


$$\frac{\mathcal{A}\{\{A\}\bullet B\}\lor F\}}{\mathcal{A}\{\{A\lor F\}\bullet B\}}\text{ o\vlash}\_{1'}$$

We will use primes to systematically name the symmetric variants of rules.

**Proposition 2 (Soundness).** *Interpreting* ▹ *as* ⊃ *and* ◦ *as* ∧*, each rule of fig. 1 with premise* P *and conclusion* Q *has the property that* P ⊢ Q*.*

*Proof.* Straightforward consequence of theorem 1. ⊓⊔

Two further administrative steps remain to complete the technique. First, since the rules of fig. 1 always contain an interaction formula in the conclusion, we need to add some rules that can conclude ordinary (non-interaction) formulas. Since we read each inference rule from conclusion to premise, we will call these the *interaction creation* rules, which are shown in the first part of fig. 2. To incorporate non-linearity, we add a separate contraction rule; this keeps the interaction creation rules simple, but it needs to be explicitly invoked. These interaction creation rules are obviously sound under the interpretation of proposition 2.

**Fig. 3.** Lnip derivation fragment for the S-combinator

The final step is to detect when a proof is complete. Since every inference rule presented so far has a single premise, we will say that a proof is complete when the final (again reading bottom to top) premise is, effectively, ⊤. What do we mean by "effectively"? One candidate definition could be that a purely algorithmic procedure can detect when a proof is finished in linear time. For instance, we can say that a proof is complete if its premise can be established using only the *simplification rules* shown in the second part of fig. 2. These rules may be applied in any arbitrary order and at any time. An implementation of the technique may choose to apply these simplification rules on the fly.

**Definition 3.** *The collection of rules in figures 1 and 2 will be known as the proof system* Lnip*. If* <sup>A</sup> *and* <sup>B</sup> *are formulas or interaction formulas, we write* A Lnip −−−<sup>→</sup> <sup>B</sup> *to mean that either* <sup>A</sup> <sup>=</sup> <sup>B</sup> *or there is an* Lnip *derivation where the topmost rule has premise* A *and the bottom-most rule has conclusion* B*.* ⊓⊔

#### **Theorem 4 (Completeness of Lnip).** *If* ⊢ F*, then* ⊤ Lnip −−−→ F*.*

*Proof (Sketch).* There are many ways to prove this, both syntactic and semantic. An instructive syntactic proof goes as follows. For a small variant of the G3ip sequent calculus [11], we show that every inference rule is admissible in Lnip under a suitable formula interpretation of sequents. Thus, any sequent proof is recoverable in terms of Lnip inferences. We then just appeal to completeness of the sequent calculus. ⊓⊔

*Example 5.* A Lnip derivation of the S-combinator formula, (a⊃b⊃c)⊃(a⊃b)⊃a⊃c, is shown in fig. 3. The interaction connectives ▹ and ◦ take the precedence and associativity of ⊃ and ∧ respectively. The locus where a Lnip rule is applied is depicted with a highlight. Of course, the S-combinator formula cannot be proved without appealing to contraction at least once, which is seen by the appeal to cont in the derivation.

An extremely interesting aspect of this example Lnip derivation is that it begins by considering the first two assumptions, (a ⊃ b ⊃ c) and (a ⊃ b), of the S-combinator formula. The user might have indicated this consideration by drawing a *link* between the two occurrences of b, highlighted in orange and blue in fig. 3. The effect of this consideration is to perform a "composition" of the two assumptions into the stronger assumption (a ⊃ a ⊃ ⊤ ⊃ c), which could of course have been simplified to (a ⊃ a ⊃ c) immediately. In shallow proof systems such as the sequent calculus or natural deduction this kind of compositional step cannot be taken as such, and would require cuts or lemmas.

As explained in the introduction, this kind of composition might have been discovered in the process of exploration by the simple strategy of drawing a *link* between the two occurrences of b. Such a link is legal because in the common context that contains both occurrences of b, their ancestral connective is ⊃, which can be turned into a ▹ interaction using the ▹ rule. Once these two occurrences are linked, we can interpret the interaction rules (fig. 1) as trying to bring the two ends of the link closer. Indeed, in each of the rules of fig. 1, we can say that one of the ends of the link is in the formula A and the other is in the formula B. We are therefore ready to formulate the linking procedure.

**Definition 6 (Subformula Linking Procedure).** *Repeat the following sequence of steps until the conjecture formula (i.e., end-formula)* F *is transformed to* ⊤ *(success), no fruitful progress can be made (failure), or the proof attempt is aborted by the user.*


The most important step in the inner loop of the procedure is step 4. The rules for interaction are not unambiguous because the conclusions of different rules can overlap. Let us start by examining the positively signed rules; as an example, consider the interaction <sup>C</sup>{(<sup>F</sup> <sup>⊃</sup> <sup>A</sup>) ▹ (<sup>G</sup> <sup>⊃</sup> <sup>B</sup>)}, with the understanding that the endpoints of the indicated link in step 2 are present in A and B. There are two possible ways to resolve this link:

$$\frac{\frac{\mathcal{C}\{F \land \{G \supset \{A \rhd B\}\}\}}{\mathcal{C}\{F \land \{A \rhd \{G \supset B\}\}\}} \rhd \mathfrak{D}\_2}{\mathcal{C}\{\{F \supset A\} \rhd \{G \supset B\}\}} \rhd \mathfrak{P} \qquad \frac{\frac{\mathcal{C}\{G \supset \{F \land \{A \rhd \} \models B\}\}}{\mathcal{C}\{G \supset \{\{F \supset A\} \models B\}\}} \rhd \mathfrak{P}\_2}{\mathcal{C}\{\{F \supset A\} \rhd \{G \supset B\}\}} \rhd \mathfrak{D}\_2$$

Does the choice matter? Yes, because the formulas F ∧ (G⊃ H) and G⊃(F ∧ H) are not intuitionistically equivalent; indeed, the former strictly entails the latter. Hence, one of the two alternatives produces a strictly stronger—and potentially unprovable!—premise. Which one should the procedure pick?

This ambiguity also existed in the original formulation of the formula linking procedure for classical linear logic [4], and we can use the same answer used in that work. The key insight is that many of the ambiguous cases can be resolved by a simple analysis of *polarities*. A detailed discussion of polarity (and the oft-associated *focusing* discipline [1]) is not relevant to this work, however.<sup>2</sup> We will instead just use the observation that some of the interaction rules of fig. 1 are *asynchronous*, meaning that the premise of the rule is equiderivable as the conclusion—assuming we replace ▹ and ◦ with ⊃ and ∧ respectively—while other rules are *synchronous*, which means that the premise strictly entails the conclusion. For the specific example above, the ▹⊃<sup>2</sup> rule is asynchronous, because the order of assumptions in an implication is immaterial (at least in intuitionistic logic), while the ⊃▹ rule is synchronous since its conclusion cannot justify the premise. We can draw up this table for all the positively signed rules.

$$\begin{array}{|l|l|l|l|}
\hline
\text{asynchronous rules:} & \mathfrak{d}\wedge\_1, \mathfrak{d}\wedge\_2, \,\forall \mathfrak{d}\_1, \,\forall \mathfrak{d}\_2, \,\mathfrak{d}\mathfrak{D}\_1, \,\mathfrak{d}\mathfrak{D}\_2 \\
\text{synchronous rules:} & \mathfrak{d}\vee\_1, \,\mathfrak{d}\vee\_2, \,\mathfrak{d}\mathfrak{d}\_1, \,\mathfrak{d}\mathfrak{d}\_2, \,\mathfrak{d}\mathfrak{d} \\
\hline
\end{array}$$

Whenever there is a choice between a synchronous and an asynchronous rule to apply first (reading from bottom to top), we should pick the asynchronous rule, since that does not destroy derivability. If we have a choice of two asynchronous rules, then the choice is immaterial, as derivability is preserved regardless; the procedure can pick arbitrarily. Different choices would just lead to associativecommutative variants of the same ultimate premise. Finally, for a choice between two synchronous rules, we can consider all such pairs from the table above to see that the choice is immaterial: all choices have the same result.

The story is not quite as simple for the negatively signed rules of fig. 1, where every single rule would be synchronous by our definition. Unlike in the positively signed case, here we have a critical pair.

$$\frac{\frac{\mathcal{A}\{\{F\supset\{A\circ B\}\}\vee G\}}{\mathcal{A}\{\{\{F\supset A\}\circ B\}\vee G\}}\operatorname{\circ}\mathfrak{D}\_{2'}}{\mathcal{A}\{\{F\supset A\}\circ\{B\circ G\}\}}\operatorname{\circ}\mathfrak{v}\_{1} \qquad \frac{\frac{\mathcal{A}\{F\supset\{A\circ B\}\vee G\}}{\mathcal{A}\{F\supset\{A\circ\{B\circ G\}\}\}}\operatorname{\circ}\mathfrak{v}\_{1}}{\mathcal{A}\{\{F\supset A\}\circ\{B\circ G\}\}}\operatorname{\circ}\mathfrak{D}\_{2'}$$

As before, the premises are not equiderivable. Resolving this ambiguity is going to be as hard as fully automated proof search, which will therefore not be recursively

<sup>2</sup> Our choice of connectives here has only negative polarity connectives except ∃ and ∨. In intuitionistic logic it is also possible to have a positive ∧ and atoms of both polarities [5,10], but this generality is not necessary for the present work.


**Fig. 4.** System Lni: rules for quantifiers and terms

solvable as soon as we introduce quantifiers. The subformula linking procedure needs further guidance from the user to resolve the ambiguity. A variant of this ambiguity can also be found in the original subformula linking work for classical linear logic [4]; there, the solution was to make the links *directed*. Then, whenever there is a choice to be made—which will necessarily have to be a choice between one subformula containing the *source* of the link and the other containing the *destination*—the procedure can choose to perform the rule corresponding to the *destination first*. In the above critical pair, for instance, if A contained the source and B the destination, then we would perform the ◦∨<sup>1</sup> step first (i.e., follow the left derivation). This choice is made to evoke the intuition that *the source is brought to the destination*; the context of the destination swallows the context of the source.

**Definition 7 (Directed Subformula Linking Procedure).** *We modify the procedure of definition 6 by making the links in step 2 directed, and in the resolution step 4 we break synchronous/synchronous ties for negatively signed rules by performing the rule for the destination first.*

#### **2.2 Quantifiers**

Extending Lnip with first-order quantifiers can be done in a number of ways. Here we present a parsimonious extension that avoids any up front commitments with regard to the strength of the term language. Our terms (written s, t, . . .) have the following grammar:

$$[s, t, \ldots \quad \text{::} = x \mid \mathsf{f} \cdot \vec{s} \; ]$$

where we write s⃗ to stand for a list of terms [s1, s2,...,sn]. We use x, y, . . . to range over variables and f, g,... to range over function symbols, and we abbreviate f⋅[] to f. We also extend atomic formulas: they are now written a⋅s⃗ where a is a predicate symbol, and we again abbreviate a⋅[] to a. To formulas and contexts we now add the two quantifiers, ∀ and ∃, to give the following extended grammars, where ∗ ∈ {∧, ∨} and Q ∈ {∀, ∃}.

$$\begin{aligned} \{A, B, \dots \:: = a \cdot \vec{s} \mid A \wedge B \mid \top \mid A \vee B \mid \bot \mid A \supset B \mid \forall x. A \mid \exists x. A \mid \exists x. A \mid \bot \mid \{\cdot\} \mid \bot \mid A \mid \bot \mid \langle \cdot\rangle \mid \bot \mid \langle \cdot\rangle \mid \langle \cdot\rangle \\ \mathcal{L}\{\} ::= \{\} \mid A \ast \mathcal{C}\{\} \mid \mathcal{C}\{\} \ast B \mid Qx. \mathcal{L}\{\} \mid A \supset \mathcal{C}\{\} \mid \mathcal{A}\{\} \supset B \mid \langle \cdot\rangle \\ \mathcal{A}\{\} ::= A \ast \mathcal{A}\{\} \mid \mathcal{A}\{\} \ast B \mid Qx. \mathcal{A}\{\} \mid A \supset \mathcal{A}\{\} \mid \mathcal{C}\{\} \supset B \end{aligned}$$

We write <sup>C</sup>{<sup>t</sup> term} to assert that the term <sup>t</sup> is well-formed for the hole in <sup>C</sup>{}, i.e., all the (free) variables of <sup>t</sup> are bound by some quantifier that the hole in <sup>C</sup>{} is in the scope of. We also write <sup>x</sup>#<sup>t</sup> or <sup>x</sup>#<sup>A</sup> to indicate that the variable x is not free in t or A respectively. Finally, the capture-avoiding substitution of t for x in a term u or formula A is written [t/x]u or [t/x]A respectively. The replacement of formulas in contexts, on the other hand, is not capture-avoiding <sup>C</sup>{A}; instead, this replacement is considered to be well-formed whenever every free variable <sup>x</sup> of <sup>A</sup> has the property that <sup>C</sup>{<sup>x</sup> term}.

In order to give ourselves maximum freedom in the definition of the first-order extension, we will use the additional binary predicate symbol ≐ to denote equality. Given two lists of terms s⃗ = [s1,...,sn] and t ⃗= [t1,...,tn] of equal length, we will write s⃗≐ t ⃗ to stand for (s<sup>1</sup> ≐ t1) ∧⋯∧ (s<sup>n</sup> ≐ tn) if n > 0 and for ⊤ otherwise. Using this additional predicate, the terminal rule in of Lnip is modified to account for the term arguments.

**Definition 8 (System Lni).** *The system* Lni *is an extension of* Lnip *by removing the* in *rule of* Lnip *and adding the rules of fig. 4.*

**Theorem 9 (Completeness of Lni).** *If* ⊢ F *in a complete sequent calculus for first-order intuitionistic logic (e.g., G3i [11]) then* ⊤ Lni −−→ F*.*

*Proof (Sketch).* We can follow the same strategy as for theorem 4. Note that for any term <sup>t</sup>, the rules refl and cong suffice to reduce <sup>C</sup>{t≐t} to <sup>C</sup>{⊤}. A transitivity rule for ≐ is not needed: no ≐ is created in an negatively signed context. ⊓⊔

*Example 10.* Two example Lni derivations are shown in fig. 5.


**Fig. 5.** Two example Lni derivations

# **3 Incorporating Arity-Typed** *λ***-Terms**

To make the calculus Lni of the previous section suitable to host a type theory as an object language, we will need to generalize from first-order terms to general λ-terms. We will follow a standard technique known variously as *higherorder abstract syntax* (HOAS) [12] or λ*-tree syntax* [7] that treats the *pure* λ-calculus—together with αβη-equality as its equational theory—to represent object languages. To keep things computable, we will use simply typed λ-terms with only one basic type, which is sometimes known as *arity typing*. Arity types (α, β, . . .) and terms (s, t, . . .) have the following grammar.

$$\{\alpha, \beta, \dots \dots \cong \star \mid \alpha \to \beta\} \qquad \begin{array}{c} h \ \coloneqq = x \mid \mathsf{k} \qquad \begin{array}{c} s, t, \dots \dots \mathrel{\coloneqq} h \cdot \vec{s} \mid \lambda x \colon \alpha \text{ } t \} \end{array}$$

where x, y, . . . range over variables, and sans-serif identifiers such as k range over term constants. For formulas, we also change the quantifiers Qx. F to their arity typed forms Qx∶α. F, where Q ∈ {∀, ∃}.

We keep λ-terms in canonical *spine form*, where the head (h) of an application is identified and separated; in more usual notation, h⋅[s1,...,sn] would be written as the iterated application (⋯(h s1) ⋯ sn). The definition of substitution, [t/x]s, must be modified to retain spine forms, which is usually done by removing redexes on the fly; for example (using @ as an auxiliary operation):

$$\begin{bmatrix} t/x \end{bmatrix} \mathbf{k} = \mathbf{k} \cdot \begin{bmatrix} \\ \end{bmatrix} \qquad \begin{bmatrix} t/x \end{bmatrix} x = t \qquad \begin{bmatrix} t/x \end{bmatrix} y = y \cdot \begin{bmatrix} \\ \end{bmatrix} \quad \text{(where } x \text{ and } y \text{ are different)} $$
 
$$ \begin{bmatrix} t/x \end{bmatrix} \{\lambda y; \alpha. \, s\} = \lambda y; \alpha. \begin{bmatrix} t/x \end{bmatrix} s $$
 
$$ \begin{bmatrix} t/x \end{bmatrix} \{h \cdot \begin{bmatrix} s\_1, \dots, s\_n \end{bmatrix} \} = \begin{bmatrix} \begin{bmatrix} t/x \end{bmatrix} h \rangle \otimes \left[ \begin{bmatrix} t/x \end{bmatrix} s\_1, \dots, \begin{bmatrix} t/x \end{bmatrix} s\_n \right] $$
 
$$ \begin{aligned} \{\lambda x; \alpha. \, s\} \otimes \left[ t\_1, t\_2, \dots, t\_n \right] &= \left\{ \begin{bmatrix} t\_1/x \end{bmatrix} s \right\} \otimes \left[ t\_2, \dots, t\_n \right] \\ \{h \cdot \begin{bmatrix} s\_1, \dots, s\_m \end{bmatrix} \} \otimes \left[ t\_1, \dots, t\_n \right] &= h \cdot \begin{bmatrix} s\_1, \dots, s\_m, t\_1, \dots, t\_n \end{bmatrix} \end{aligned} $$

Most of the inference rules of system Lni generalize easily to this setting. The immediate differences will be with respect to the simplification rules. For the inst rule, we use a variant judgement <sup>C</sup>{<sup>t</sup> <sup>∶</sup> <sup>α</sup>} to mean that the <sup>λ</sup>-term <sup>t</sup> is well-typed at type α based on the type assumptions of its free variables that are bound in the scope of the hole in C{}. It is possible to view this judgement as being defined by inference rules; for instance (for Q ∈ {∀, ∃}):

$$\begin{array}{cc} \begin{array}{c} \mathcal{C}\{\forall x \colon \alpha.\{t : \beta\}\} \\ \mathcal{C}\{\forall x \colon \alpha.\{t : \alpha\}\} \end{array} & \begin{array}{c} \mathcal{C}\{\forall x \colon \alpha.\{t : \beta\}\} \\ \mathcal{C}\{\langle\lambda x : \alpha.t\rangle : \alpha \to \beta\} \end{array} \\ \hline \begin{array}{c} \mathcal{C}\{h : \alpha\_{1} \to \cdots \to \alpha\_{n} \to \beta\} \quad \mathcal{C}\{s\_{i} : \alpha\_{i}\} \\ \hline \mathcal{C}\{\{h \cdot [s\_{1}, \ldots, s\_{n}]\} : \beta\} \end{array} \end{array}$$

The rules refl and cong of Lni are replaced with:

$$\frac{\mathcal{C}\{\vec{s} \doteq \vec{t}\}}{\mathcal{C}\{h \cdot \vec{s} \doteq h \cdot \vec{t}\}} \text{ con} \qquad \frac{\mathcal{C}\{\forall x \colon \alpha.\{s \doteq t\}\}}{\mathcal{C}\{\{\lambda x \colon \alpha.\, s\} \models \{\lambda x \colon \alpha.\, t\}\}} \text{ abs}$$

$$\frac{\mathcal{C}\{\{\lambda x \colon \alpha.\, h \cdot \{s\_1, \ldots, s\_n, x\}\} \doteq \{\lambda x \colon \alpha.\, t\}\}}{\mathcal{C}\{h \cdot \{s\_1, \ldots, s\_n\} \doteq \{\lambda x \colon \alpha.\, t\}\}} \text{ } \eta \cdot \text{exp} \qquad \text{(and its symm. variant)}$$

**Definition 11 (System Lni**λ**).** *The system* Lni<sup>λ</sup> *is a modification of* Lni *with the* ▽ *rules,* cong*,* abs*,* <sup>η</sup>-exp*, and* in *above.*

**Theorem 12 (Completeness of Lni**λ**).** *For any formula* F *in the language of first-order logic over* λ*-terms but without any occurrence of* ≐*, if* ⊢ F *in a complete sequent calculus then* ⊤ Lni<sup>λ</sup> −−−→ F*.*

*Proof (Sketch).* Once again, this is a straightforward extension of the proof of theorem 9. Since there are no occurrences of ≐ in F, and in particular no occurrence of it in a negatively signed context, the rules cong, abs and η-exp are sufficient to implement αβη-equivalence. ⊓⊔

#### **4 Application: Embedding Intuitionistic Type Theories**

The first-order language over arity-typed λ-terms of the previous section has enough expressive power for a complete encoding of any pure type system [6,15]. To keep things simple in this paper, we will demonstrate the case for LF (aka λΠ) using the *simple* embedding from [15]. Expressions in LF belong to one of the following three syntactic categories: *kinds*, *types*, or *terms*.

$$K \text{ ::= } \mathtt{type} \mid \varPi x \colon A. K \tag{\text{kints}}$$

$$A, B, \dots \dots \text{::= } \mathbf{a} \; M\_1 \; \dots \; M\_n \; \mid \; IIx ; A. \; B \tag{\text{type}}$$

$$M, N, \ldots \colon \coloneqq x \mid \mathbb{k} \mid \lambda x ; A. M \mid M \text{ } N \tag{\text{terms}}$$

The LF type system is formally specified using inference rules in [9] and will not be repeated here. Instead, we will directly present a complete encoding of LF expressions using the language of Lniλ.

The encoding proceeds in two steps. First, we transform the dependently typed terms of LF into their simply typed forms, normalizing them as necessary. However, since LF terms can mention their types, we simultaneously transform LF types into simple types. This transformation erases not just the type dependencies but also the identities of the types by collapsing all of them to the same base type ⋆.

**Definition 13.** *The* forgetful map φ *specified below transforms LF terms into* Lniλ λ*-terms and LF types and kinds into* Lni<sup>λ</sup> *types.*

φ(k) = k⋅[] φ(x) = x⋅[] φ(λx∶A. M) = λx∶φ(A). φ(M) φ(M N) = φ(M) @ [φ(N)] φ(a M<sup>1</sup> ⋯ Mn) = ⋆ φ(Πx∶A. B) = φ(A) → φ(B) φ(type) = ⋆ φ(Πx∶A. K) = φ(A) → φ(K)

The second stage of the transformation recovers the information that was lost in the φ map by means of one atomic propositions, has. Using this we define a mapping ⟦⟧ that transforms types and kinds to formulas in such a way that if M ∶ A holds then ⟦A⟧φ(M) is true.

**Definition 14.** *The mapping* ⟦⟧ *transforms an LF type/kind and a* Lniλ λ*-terms into a* Lni<sup>λ</sup> *formula, specified recursively as follows.*

> ⟦a M<sup>1</sup> ⋯ Mn⟧m = has⋅[m, a⋅[φ(M1),...,φ(Mn)]] ⟦type⟧m = has⋅[m,type] ⟦Πx∶A. J⟧m = ∀x∶φ(A). ⟦A⟧x ⊃ ⟦J⟧(m @ [x])

*(where* J *can be a LF type or kind).*

**Proposition 15 (Completeness [15]).** *If the judgement* x1∶J1,...,xn∶J<sup>n</sup> ⊢ <sup>M</sup> <sup>∶</sup> <sup>A</sup> *is derivable in LF [9], then the following formula is provable in* Lniλ*:* ∀x1∶φ(J1). ⟦J1⟧(x1⋅[]) ⊃ ⋯ ⊃ ∀xn∶φ(Jn). ⟦Jn⟧(xn⋅[]) ⊃ ⟦A⟧φ(M)*.* ⊓⊔

The converse of proposition 15 does not necessarily hold, since the forgetful map φ is injective, not surjective.<sup>3</sup> In particular, since the encoding of atomic types forgets the term arguments, we have that φ(λx∶A1. s) = φ(λx∶A2. s) if φ(A1) = φ(A2); however, the latter does not guarantee that A<sup>1</sup> = A2. Thus, ⟦Πx∶A1. B⟧φ(λx∶A2. s) may hold even when A<sup>1</sup> ≠ A2. To guarantee surjectivity, we must use the *canonical LF* variant of the LF type theory where the type ascription on λ is omitted and the type system is made bidirectional [19]; this will guarantee that only Π-types will ascribe types to bound variables, removing the issue highlighted above.

<sup>3</sup> This issue, pointed out in [16], is a mistake in earlier papers such as [6,15].


**Fig. 6.** <sup>A</sup> Lni<sup>λ</sup> derivation of an embedded LF type (example 16). Some type ascriptions are elided, and doubled lines denote simplifications.

*Example 16.* Consider the following LF type A ≜ Πu∶a.Πz∶(Πx∶a. b x). b u. By definition 14, we have:

$$\begin{aligned} \P[A]k &= \forall u \colon \mathsf{has} \cdot \mathsf{f}[u, \mathsf{a}] \supset \\ \forall z \colon \mathsf{\star} &\to \mathsf{\star}. \{\forall x \colon \mathsf{\star}. \mathsf{has} \cdot \mathsf{f}[x, \mathsf{a}] \supset \mathsf{has} \cdot [z \cdot [x], \mathsf{b} \cdot [x]]\} \supset \\ \mathsf{has} &\cdot[k, \mathsf{b} \cdot [u]]. \end{aligned}$$

Fig. 6 has an example Lniλ derivation of this formula where k is existentially quantified. As usual, highlights are used to indicate the two links the user indicated in the two ▹ rules. The derivation can be complete with the instantiation [z⋅[u]/k]; this means that the LF type A is inhabited by some LF term M for which φ(M) = z⋅[u].

Note that the fact that we have not discovered a LF term for k using the Lniλ derivation is not a problem. Given a Lniλ term k for which ⟦A⟧k is derivable, it is possible to find a term M for which φ(M) = k and M ∶ A holds in LF. One way to do this would be to use *bidirectional type checking* [14,19] to recreate deterministically—the missing LF types.

While the encoding of LF in Lniλ suffices to implement the proof by linking technique, it is a leaky encoding. As the derivation in fig. 6 proceeds, the conjecture resembles the image of the ⟦⟧ map less and less; in particular, the conjecture starts to accumulate things that are not fundamentally present in the LF type system, such as term equations, conjunctions, and existential quantifiers. The purported novice user mentioned in the introduction thus needs to be familiar with at least two languages: LF and (a somewhat esoteric variant of) first-order logic. One way

to improve matters would be to try to define the linking procedure directly on the LF type system, but this example seems to indicate that the LF language is not expressive enough to capture all the structures that will occur when resolving a link. At the very least, it seems that some kind of pairing construct—i.e., Σ-types—is essential. Moreover, to capture free floating has assumptions, the language of LF might need to be extended further with judgemental expressions of the form ⟨M∶A⟩.

## **5 Conclusion and Future Directions**

We have presented a formal system of *proof by linking* for intuitionistic logic and a derived system for the dependent type theory LF. We are currently in the process of implementing this system as a variant of the *Profound* tool, which was initially developed for classical linear logic in [4].

In order for this system to be usable in a general purpose interactive theorem prover based on first-order logic (such as Abella [2]) or dependent type theory (such as Twelf [13]), the most important missing ingredient is support for inductive definitions and reasoning by induction. The first step in a proof by structural induction is to indicate which assumption(s) will drive the analysis, which is closer to a *pointing* than a *linking*. Thus, proof by linking and pointing will need to co-exist.

A further improvement that would be made as a matter of course in an implementation would be the use of a unification engine to remove the clutter of ≐ formulas. It is worth investigating (in future work) if the linking metaphor can also be used for algebraic operations on terms based on ≐. In many systems ≐-assumptions can be used to rewrite terms, which is readily incorporated into the linking scheme: just link a term to one side of a ≐. We can in fact see it as variants of the inst rule:

$$\frac{\mathcal{C}\{\left[t/x\right]\mathcal{C}\{\top\}\}}{\mathcal{C}\{\exists x.\mathcal{C}\{x\doteq t\}\}}\qquad\frac{\mathcal{A}\{\left[t/x\right]\mathcal{A}^{\prime}\{\top\}\}}{\mathcal{A}\{\forall x.\mathcal{A}^{\prime}\{x\doteq t\}\}}$$

It is worth investigating if such variants of inst can make the embedding of LF into Lniλ less leaky.

Note that proof by linking, like proof by pointing, can easily be incorporated as a tactic in an existing proof system. After all, each of the inference rules of Lniλ is logically motivated, and can therefore be established as a certifying tactic. The quality of the formal proof terms produced in this way will be poor since most proof term languages are not designed for deep rewriting – indeed, the proof term for each Lniλ inference rule may have a size that is exponential in that of the conjecture. It is perhaps better to see proof by linking as a *proof exploration* tool for quickly testing out logical properties of a conjecture before attempting a traditional structured proof. In the hands of an expert user, this exploration mode can also help to discover useful lemmas to bridge the gap between an existing collection of proved theorems and a desired target theorem.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Efficient SAT-based Proof Search in Intuitionistic Propositional Logic**

Camillo Fiorentini(B)

Department of Computer Science, Universit`a degli Studi di Milano, Milan, Italy

**Abstract.** We present an efficient proof search procedure for Intuitionistic Propositional Logic which involves the use of an incremental SATsolver. Basically, it is obtained by adding a restart operation to the system intuit by Claessen and Ros´en, thus we call our implementation intuitR. We gain some remarkable advantages: derivations have a simple structure; countermodels are in general small; using a standard benchmarks suite, we outperform intuit and other state-of-the-art provers.

## **1 Introduction**

The intuit theorem prover by Claessen and Ros´en [2] implements an efficient decision procedure for Intuitionistic Propositional Logic (IPL) based on a Satisfiability Modulo Theories (SMT) approach. Given an input formula α, the clausification module of intuit computes a sequent σ <sup>=</sup> R, X <sup>⇒</sup> g equivalent to α with respect to IPL-validity, where R, X and g have a special form: R is a set of clauses, X is a set of implications (a <sup>→</sup> b) <sup>→</sup> c, with a, b, c atoms, g is an atom. The decision procedure at the core of intuit searches for a Kripke model <sup>K</sup> such that at its root all the formulas in R and X are forced and g is not forced; we call <sup>K</sup> a countermodel for σ, since it witnesses the non-validity of σ in IPL. The search is performed via a proper variant of the DPLL(<sup>T</sup> ) procedure [12], whose top-level loop exploits an incremental SAT-solver. This leads to a highly performant decision strategy; actually, on the basis of a standard benchmarks suite, intuit outperforms two of the state-of-the-art provers for IPL, namely fCube [5] and intHistGC [11]. At first sight, the intuit decision procedure seems to be far away from the traditional techniques for deciding IPL validity; on the other hand, the in-depth investigation presented in [10] unveils a close and surprising connection between the intuit approach based on SMT and the known proof-theoretic methods. The crucial point is that the main loop of the decision procedure mimics a standard root-first proof search strategy for the sequent calculus LJTSAT [10] (see Fig. 7), a variant of Dyckhoff's calculus LJT [3]. In [10] the intuit decision procedure is re-formulated so that, given a sequent <sup>σ</sup>, it outputs either a derivation of <sup>σ</sup> in LJTSAT or a countermodel for <sup>σ</sup>.

Here we continue this investigation to better take advantage of the interplay between the SMT perspective and proof-theoretic methods. At first, we have enhanced the Haskell intuit code<sup>1</sup> by implementing the derivation/countermodel

<sup>1</sup> Available at https://github.com/koengit/intuit.

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5 13 217 233 – , 2021.

extraction procedures discussed in [10]. We experimented some unexpected and weird phenomena: derivations are often convoluted and contain applications of the cut rule which cannot be trivially eliminated; countermodels in general contain lots of redundancies. To overcome these issues, we have redesigned the decision procedure. Differently from intuit, in the main loop we keep all the worlds of the countermodel under construction. Whenever the generation of a new world fails, the current model is emptied and the computation restarts with a new iteration of the main loop. We call the obtained prover intuitR (intuit with Restart). We gain some remarkable advantages. Firstly, the proof search procedure has a plain and intuitive presentation, consisting of two nested loops (see the flowchart in Fig. 3). Secondly, derivations have a linear structure, formalized by the calculus <sup>C</sup><sup>→</sup> in Fig. 1; basically, a derivation in C<sup>→</sup> is a cut-free derivation in LJTSAT having only one branch. Thirdly, the countermodels obtained by intuitR are in general smaller than the ones obtained by intuit, since restarts cross out redundant worlds. We have replicated the experiments in [2] (1200 benchmarks): as reported in the table in Fig. 9 and in the scatter plot in Fig. 11, intuitR has better performances than intuit. The intuitR implementation and other additional material (e.g., the omitted proofs, a detailed report on experiments) can be downloaded at https://github.com/cfiorentini/intuitR.

# **2 Preliminary Notions**

Formulas, denoted by lowercase Greek letters, are built from an infinite set of propositional variables V , the constant <sup>⊥</sup> and the connectives <sup>∧</sup>, <sup>∨</sup>, <sup>→</sup>; the formula α <sup>↔</sup> β stands for (α <sup>→</sup> β) <sup>∧</sup> (β <sup>→</sup> α). Elements of the set V ∪ {⊥} are called *atoms* and are denoted by lowercase Roman letters, uppercase Greek letters denote sets of formulas. A *(classical) interpretation* M is a subset of V , identifying the propositional variables assigned to true. By M <sup>|</sup><sup>=</sup> α we mean that α is true in M; moreover, M <sup>|</sup><sup>=</sup> Γ iff M <sup>|</sup><sup>=</sup> α for every α <sup>∈</sup> Γ. We write Γ <sup>c</sup> <sup>α</sup> iff, for every interpretation <sup>M</sup>, <sup>M</sup> <sup>|</sup><sup>=</sup> <sup>Γ</sup> implies <sup>M</sup> <sup>|</sup><sup>=</sup> <sup>α</sup>. A formula <sup>α</sup> is CPL-valid (valid in Classical Propositional Logic) iff ∅ c α.

A (rooted) Kripke model for IPL (Intuitionistic Propositional Logic) is a quadruple W, <sup>≤</sup>, r, ϑ where W is a finite and non-empty set (the set of *worlds*), <sup>≤</sup> is a reflexive and transitive binary relation over W, the world r (the *root* of <sup>K</sup>) is the minimum of W w.r.t. <sup>≤</sup>, and ϑ : W → <sup>2</sup><sup>V</sup> (the *valuation* function) is a map obeying the persistence condition: for every pair of worlds <sup>w</sup><sup>1</sup> and <sup>w</sup><sup>2</sup> of <sup>K</sup>, <sup>w</sup><sup>1</sup> <sup>≤</sup> <sup>w</sup><sup>2</sup> implies <sup>ϑ</sup>(w<sup>1</sup>) <sup>⊆</sup> <sup>ϑ</sup>(w<sup>2</sup>). The valuation <sup>ϑ</sup> is extended into a *forcing* relation between worlds and formulas as follows:

$$\begin{array}{lll} \;w \Vdash p \text{ iff } p \in \vartheta(w), \forall p \in V & w \Vdash \bot & w \Vdash \alpha \land \beta \text{ iff } w \Vdash \alpha \text{ and } w \Vdash \beta\\ \;w \Vdash \alpha \lor \beta \text{ iff } w \Vdash \alpha \text{ or } w \Vdash \beta & w \Vdash \alpha \to \beta \text{ iff } \forall w' \succeq w, \; w' \Vdash \alpha \text{ implies } w' \Vdash \beta. \end{array}$$

By w - Γ we mean that w α for every α <sup>∈</sup> Γ. A formula α is IPL-valid iff, for every Kripke model <sup>K</sup> we have r α (here and below r designates the root of <sup>K</sup>). Thus, if there exists a model <sup>K</sup> such that r α, then α is not IPL-valid; we call <sup>K</sup> <sup>a</sup> *countermodel* for α, written K |<sup>=</sup> α, and we say that α is *countersatisfiable*. We write Γ <sup>i</sup> <sup>δ</sup> iff, for every model <sup>K</sup>, <sup>r</sup> - Γ implies r δ; thus,

$$\frac{R \vdash\_{\mathcal{C}} g}{R, X \Rightarrow g} \operatorname{cpl}\_0 \qquad \frac{R, A \vdash\_{\mathcal{C}} b \qquad R, \varphi, X \Rightarrow g}{R, X \Rightarrow g} \operatorname{cpl}\_1 \qquad \begin{array}{c} (a \to b) \to c \in X \\ A \subseteq V \\ \varphi = \bigwedge(A \mid \{a\}) \to c \end{array}$$

**Fig. 1.** The sequent calculus <sup>C</sup><sup>→</sup>; R, X <sup>⇒</sup> g is an r-sequent.

α is IPL-valid iff <sup>∅</sup><sup>i</sup> <sup>α</sup>. Let <sup>σ</sup> be a sequent of the form <sup>Γ</sup> <sup>⇒</sup> <sup>δ</sup>; <sup>σ</sup> is IPL-valid iff Γ <sup>i</sup> <sup>δ</sup>. By K |<sup>=</sup> <sup>σ</sup> we mean that <sup>r</sup> - Γ and r δ. Note that such a model <sup>K</sup> witnesses that σ is not IPL-valid; we say that <sup>K</sup> is a *countermodel* for σ and that σ is *counter-satisfiable*.

*Clausification* We review the main concepts about the clausification procedure described in [2]. *Flat clauses* ϕ and *implication clauses* λ are defined as

$$\begin{array}{llll} \varphi := \bigwedge A\_1 \to \bigvee A\_2 \mid \bigvee A\_2 & \quad \quad \quad \quad \quad \quad \emptyset \subset A\_k \subseteq V \cup \{\perp\}, \text{ for } k \in \{1, 2\} \\\lambda := (a \to b) \to c & \quad \quad a \in V, \ \{b, c\} \subseteq V \cup \{\perp\} \end{array}$$

where - <sup>A</sup><sup>1</sup> and <sup>A</sup><sup>2</sup> denote the conjunction and the disjunction of the atoms in <sup>A</sup><sup>1</sup> and <sup>A</sup><sup>2</sup> respectively (-{a} <sup>=</sup> {a} <sup>=</sup> a). Henceforth, - ∅ → <sup>A</sup><sup>2</sup> must be read as A<sup>2</sup>; moreover, <sup>R</sup>, <sup>R</sup><sup>1</sup>, . . . denote sets of flat clauses; <sup>X</sup>, <sup>X</sup><sup>1</sup>, . . . sets of implication clauses; A, A<sup>1</sup>, . . . sets of atoms. The intuit procedure relies on the following property (see Lemma 2 in [10]):

**Lemma 1.** *For every set of flat clauses* R *and every atom* g*,* R <sup>i</sup> <sup>g</sup> *iff* <sup>R</sup> c g*.*

In the decision procedure, flat clauses are actively used only in classical reasoning. A pair (R, X) is <sup>→</sup>*-closed* iff, for every (a <sup>→</sup> b) <sup>→</sup> c <sup>∈</sup> X, b <sup>→</sup> c <sup>∈</sup> R. An *r-sequent* (reduced sequent) is a sequent Γ <sup>⇒</sup> g where g is an atom, Γ <sup>=</sup> R <sup>∪</sup> X and (R, X) is <sup>→</sup>-closed. Given a formula α, the clausification procedure yields a triple (R, X, g) such that R, X <sup>⇒</sup> g is an r-sequent and:

(1) <sup>i</sup> <sup>α</sup> iff R, X <sup>i</sup> <sup>g</sup>; (2) K |<sup>=</sup> R, X <sup>⇒</sup> <sup>g</sup> implies K |<sup>=</sup> <sup>α</sup>, for every <sup>K</sup>. <sup>2</sup>

Thus, IPL-validity of formulas can be reduced to IPL-validity of r-sequents.

## **3 The Calculus** *C<sup>→</sup>*

The sequent calculus <sup>C</sup><sup>→</sup> consists of the rules cpl<sup>0</sup> and cpl<sup>1</sup> from Fig. 1. Rule cpl<sup>0</sup> (axiom rule) can only be applied if the condition <sup>R</sup> <sup>c</sup> <sup>g</sup> holds, rule cpl<sup>1</sup> requires that R, A <sup>c</sup> <sup>b</sup> holds. In rule cpl1, (<sup>a</sup> <sup>→</sup> <sup>b</sup>) <sup>→</sup> <sup>c</sup> is the *main formula* and A the *local assumptions*; note that A is any set of propositional variables (not necessarily containing a). Derivations are defined as usual (see e.g. [14]);

<sup>2</sup> In [2] the clausification procedure outputs a triple (R, X, g) satisfying (1) and (2); the <sup>→</sup>-closure of (R, X) is performed at the beginning of the decision procedure (for every (a <sup>→</sup> b) <sup>→</sup> c <sup>∈</sup> X, the clause b <sup>→</sup> c is added to R).

$$\begin{array}{c} R\_{m-1}, \ A\_{m-1} \vdash\_{c} b\_{m-1} \quad \frac{R\_{m} \vdash\_{c} g}{R\_{m}, X \Rightarrow g} \ \lambda\_{m-1} \\\ \vdots \\\ \vdots \\\ R\_{0}, A\_{1} \vdash\_{c} b\_{0} \quad \frac{R\_{1}, A\_{1} \vdash\_{c} b\_{1}}{R\_{1}, X \Rightarrow g} \ \lambda\_{1} \\\ \lambda\_{k} = (a\_{k} \to b\_{k}) \to c\_{k} \in X, \qquad \varphi\_{k} = \bigwedge(A\_{k} \mid \{a\_{k}\}) \to c\_{k}, \qquad R\_{k+1} = R\_{k} \cup \{\varphi\_{k}\} \end{array}$$

$$\text{Fig. 2. Perivation of } R\_0, X \Rightarrow g \text{ in } C^{\to} \text{ (} 0 \le k \le m - 1\text{)}.$$

by <sup>C</sup><sup>→</sup> σ we mean that there exists a derivation of the r-sequent σ in C<sup>→</sup>. In showing derivations, we leave out rule names and we display the main formulas of cpl<sup>1</sup> applications. Soundness of rule cpl<sup>1</sup> relies on the following property:

$$\text{(a) If } R, A \vdash\_c b \text{, then } R, (a \to b) \to c \vdash\_i \varphi \text{, where } \varphi = \bigwedge(A \mid \{a\}) \to c.$$

Indeed, let R, A <sup>c</sup> <sup>b</sup>. By Lemma 1 R, A <sup>i</sup> <sup>b</sup>, thus R, A \ {a} <sup>i</sup> <sup>a</sup> <sup>→</sup> <sup>b</sup>. It follows that R,(a <sup>→</sup> b) <sup>→</sup> c, A \ {a} <sup>i</sup> <sup>c</sup>, hence R,(<sup>a</sup> <sup>→</sup> <sup>b</sup>) <sup>→</sup> <sup>c</sup> <sup>i</sup> <sup>ϕ</sup>. By Lemma 1 and (a), the soundness of C<sup>→</sup> follows:

**Proposition 1.** <sup>C</sup><sup>→</sup> R, X <sup>⇒</sup> g *implies* R, X i g*.*

A derivation of <sup>σ</sup><sup>0</sup> <sup>=</sup> <sup>R</sup>0, X <sup>⇒</sup> <sup>g</sup> has the plain form shown in Fig. 2: it only contains the branch of sequents <sup>σ</sup><sup>k</sup> <sup>=</sup> <sup>R</sup><sup>k</sup>, X <sup>⇒</sup> <sup>g</sup> where the sets <sup>R</sup><sup>k</sup> are increasing. Nevertheless, the design of a root-first proof search strategy for <sup>C</sup><sup>→</sup> is not obvious. Let <sup>σ</sup><sup>0</sup> be the r-sequent to be proved; we try to bottomup build the derivation in Fig. 2 by running a loop where, at each iteration k <sup>≥</sup> 0, we search for a derivation of σ<sup>k</sup>. It is convenient to firstly check whether <sup>R</sup><sup>k</sup> <sup>c</sup> <sup>g</sup> so that, by applying rule cpl0, we immediately get a derivation of <sup>σ</sup><sup>k</sup>. If this is not the case, we should pick an implication <sup>λ</sup><sup>k</sup> from <sup>X</sup> and guess a proper set of local assumptions <sup>A</sup><sup>k</sup> in order to bottom-up apply rule cpl1.

$$\begin{array}{c} R\_k, \ b\_k \vdash\_c b\_k \qquad R\_k, \ X \Rightarrow g \\ \hline \begin{array}{c} R\_k, \ X \Rightarrow g \\ \downarrow \end{array} \end{array} \lambda\_k$$

$$\begin{aligned} \lambda\_k &= (a\_k \to b\_k) \to c\_k \in X, \; b\_k \to c\_k \in R\_k\\ A\_k &= \{b\_k\}, \; \varphi\_k = b\_k \to c\_k, \; R\_{k+1} &= R\_k \\ \cdot &\cdot &\cdot \cdot \cdot \cdot \cdot \cdot \end{aligned}$$

If we followed a blind choice, the procedure would be highly inefficient; for instance, the application of rule cpl<sup>1</sup> shown on the left triggers a non-terminating loop. In-

stead, we pursue this strategy: we search for a countermodel for σ<sup>k</sup>; if we succeed, then <sup>R</sup><sup>k</sup>, X <sup>i</sup> <sup>g</sup> and, being <sup>R</sup><sup>0</sup> <sup>⊆</sup> <sup>R</sup><sup>k</sup>, we conclude that <sup>R</sup><sup>0</sup>, X <sup>i</sup> <sup>g</sup> and proof search ends. Otherwise, from the failure we learn the proper <sup>λ</sup><sup>k</sup> and <sup>A</sup><sup>k</sup> to be used in the application of rule cpl1; in next iteration, proof search restarts with the sequent <sup>σ</sup><sup>k</sup>+1, where <sup>R</sup><sup>k</sup>+1 is obtained by adding the learned clause <sup>ϕ</sup><sup>k</sup> to <sup>R</sup><sup>k</sup>. To check classical provability, we exploit a SAT-solver; each time the solver is invoked, the set <sup>R</sup><sup>k</sup> has increased, thus it is advantageous to use an incremental SAT-solver.

*Countermodels* Henceforth we define Kripke models by specifying the interpretations associated with its worlds. Let W be a finite set of interpretations with minimum <sup>M</sup>0, namely: <sup>M</sup><sup>0</sup> <sup>⊆</sup> <sup>M</sup> for every <sup>M</sup> <sup>∈</sup> <sup>W</sup>. By <sup>K</sup>(W) we denote the Kripke model W, <sup>≤</sup>, M0, ϑ where <sup>≤</sup> coincides with the subset relation <sup>⊆</sup> and <sup>ϑ</sup> is the identity map, thus M p (in <sup>K</sup>(W)) iff p <sup>∈</sup> M. We introduce the following *realizability relation* <sup>W</sup> between <sup>W</sup> and implication clauses:

$$\begin{array}{l} M \rhd\_W (a \to b) \to c \quad \text{iff} \quad (a \in M) \text{ or } (b \in M) \text{ or } (c \in M) \text{ or } \\ \quad (\exists M' \in W \text{ s.t. } M \subset M' \text{ and } a \in M' \text{ and } b \notin M'). \end{array}$$

By M <sup>W</sup> <sup>X</sup> we mean that <sup>M</sup><sup>W</sup> <sup>λ</sup> for every <sup>λ</sup> <sup>∈</sup> <sup>X</sup>. Countermodels of r-sequents can be characterized as follows:

**Proposition 2.** *Let* σ <sup>=</sup> R, X <sup>⇒</sup> g *be an r-sequent and let* W *be a finite set of interpretations with minimum* <sup>M</sup><sup>0</sup>*. Then,* <sup>K</sup>(W) |<sup>=</sup> σ *iff: (i)* g ∈ M<sup>0</sup>*; (ii) for every* <sup>M</sup> <sup>∈</sup> <sup>W</sup>*,* <sup>M</sup> <sup>|</sup><sup>=</sup> <sup>R</sup> *and* <sup>M</sup><sup>W</sup> <sup>X</sup>*.*

## **4 The Procedure** proveR

The strategy outlined in Sec. 3 is implemented by the decision procedure proveR (prove with Restart) defined by the flowchart in Fig. 3. The call proveR(R*,*X*,*g) returns Valid if the r-sequent σ <sup>=</sup> R, X <sup>⇒</sup> g is IPL-valid, CountSat otherwise; by tracing the computation, we can build a <sup>C</sup><sup>→</sup>-derivation of σ in the former case, a countermodel for σ in the latter. We exploit a single incremental SATsolver s: clauses can be added to s but not removed; by R(s) we denote the set of clauses stored in s. The solver s has associated a set of propositional variables U(s) (the universe of s); we assume that every clause ϕ supplied to s is built over U(s) (namely, every variable occurring in ϕ belongs to U(s)). The SAT-solver is required to support the following operations:

	- Create a new SAT-solver.
	- Yes(A ): thus, A <sup>⊆</sup> A and R(s), A c g;
	- No(M): thus, A <sup>⊆</sup> M <sup>⊆</sup> U(s) and M <sup>|</sup>= R(s) and g ∈ M.

In the former case it follows that R(s), A <sup>c</sup> <sup>g</sup>, in the latter R(s), A <sup>c</sup> <sup>g</sup>.

The procedure newSolver(R), defined using the primitive operations, creates a new SAT-solver containing all the clauses in R. The computation of the call proveR(R*,* X*,* g) consists of the following steps:


**Fig. 3.** Computation of proveR(R, X, g).


Note that during the computation no new variables are created, thus U(s) can be defined as the set of propositional variables occurring in R <sup>∪</sup> X ∪ {g}. We show that the call proveR(R*,*X*,*g) is correct, namely: if R, X, g match the Input Assumptions, then the Output Properties hold (see Fig. 3). We stipulate that:


We prove some properties about the computation of proveR(R*,* X*,* g).

	- (i) The set <sup>W</sup>k,j has a minimum element <sup>M</sup><sup>0</sup> and <sup>g</sup> ∈ <sup>M</sup><sup>0</sup>.
	- (ii) For every <sup>M</sup> <sup>∈</sup> <sup>W</sup>k,j , <sup>M</sup> <sup>|</sup><sup>=</sup> <sup>R</sup><sup>k</sup>.
	- (iii) If <sup>W</sup>k,j+1 is defined, then <sup>W</sup>k,j <sup>⊂</sup> <sup>W</sup>k,j+1.

Let <sup>W</sup>k,<sup>0</sup> <sup>=</sup> {M}; one can easily check that, setting <sup>M</sup><sup>0</sup> <sup>=</sup> <sup>M</sup>, (i) holds. Point (ii) follows by the fact that each <sup>M</sup> in <sup>W</sup>k,j comes from an answer No(M), thus <sup>M</sup> <sup>|</sup><sup>=</sup> <sup>R</sup><sup>k</sup>. Let <sup>W</sup>k,j+1 be defined and let <sup>W</sup>k,j+1 <sup>=</sup> <sup>W</sup>k,j∪{M}, with <sup>M</sup> computed at step (S5); there is <sup>w</sup> <sup>∈</sup> <sup>W</sup>k,j and <sup>λ</sup> = (<sup>a</sup> <sup>→</sup> <sup>b</sup>) <sup>→</sup> <sup>c</sup> <sup>∈</sup> <sup>X</sup> such that <sup>w</sup><sup>W</sup>k,j <sup>λ</sup> and <sup>w</sup> ∪ {a} ⊆ <sup>M</sup> and <sup>b</sup> ∈ <sup>M</sup>. We cannot have <sup>M</sup> <sup>∈</sup> <sup>W</sup>k,j , otherwise, since w <sup>⊆</sup> M and a <sup>∈</sup> M and b ∈ M, we would get w <sup>W</sup>k,j <sup>λ</sup>, a contradiction. Thus <sup>M</sup> ∈ <sup>W</sup>k,j , and this proves (iii).

Let 0 <sup>≤</sup> h<k be such that <sup>ϕ</sup><sup>k</sup> is defined, let <sup>w</sup><sup>k</sup>, λ<sup>k</sup> = (a<sup>k</sup> <sup>→</sup> <sup>b</sup><sup>k</sup>) <sup>→</sup> <sup>c</sup><sup>k</sup> and <sup>A</sup><sup>k</sup> be the pair and the assumptions learned at iteration <sup>k</sup> respectively; note that <sup>A</sup><sup>k</sup> <sup>⊆</sup> <sup>w</sup><sup>k</sup> ∪ {a<sup>k</sup>}. Since <sup>R</sup><sup>h</sup> ∪ {ϕ<sup>h</sup>} <sup>=</sup> <sup>R</sup><sup>h</sup>+1 <sup>⊆</sup> <sup>R</sup><sup>k</sup>, we have <sup>ϕ</sup><sup>h</sup> <sup>∈</sup> <sup>R</sup><sup>k</sup>; by (P1)(ii), it holds that <sup>w</sup><sup>k</sup> <sup>|</sup><sup>=</sup> <sup>R</sup><sup>k</sup>, hence <sup>w</sup><sup>k</sup> <sup>|</sup><sup>=</sup> <sup>ϕ</sup><sup>h</sup>. We show that <sup>w</sup><sup>k</sup> |<sup>=</sup> <sup>ϕ</sup><sup>k</sup>, and this proves (P2). Since <sup>w</sup><sup>k</sup>, λ<sup>k</sup> has been selected at Step (S4), <sup>c</sup><sup>k</sup> ∈ <sup>w</sup><sup>k</sup>; by the fact that <sup>ϕ</sup><sup>k</sup> <sup>=</sup> -(A<sup>k</sup> \ {a<sup>k</sup>}) <sup>→</sup> <sup>c</sup><sup>k</sup> and <sup>A</sup><sup>k</sup> \ {a<sup>k</sup>} ⊆ <sup>w</sup><sup>k</sup>, we conclude <sup>w</sup><sup>k</sup> |<sup>=</sup> <sup>ϕ</sup><sup>k</sup>.

Exploiting the above properties, we prove the correctness of proveR, also showing how to extract derivations and countermodels from computations.

# **Proposition 3.** *The call* proveR(R*,*X*,*g) *is correct.*

*Proof.* We start by proving that the computation never diverges. By (P2), the learned clauses <sup>ϕ</sup><sup>k</sup> are pairwise not classically equivalent; since each <sup>ϕ</sup><sup>k</sup> is built over the finite set U(s), at most 2|U(s)<sup>|</sup> such clauses can be generated, and this proves the termination of the main loop. Since every interpretation M in W is a subset of U(s), by (P1)(iii) the termination of the inner loop follows.

Let σ <sup>=</sup> R, X <sup>⇒</sup> g. If proveR(R*,*X*,*g) returns CountSat, then the computation ends at Step (S4) since no pair w, λ can be selected. By (P1), the current set W satisfies the assumptions (i),(ii) of Prop. 2; accordingly, <sup>K</sup>(W) is a countermodel for <sup>σ</sup>, thus R, X <sup>i</sup> <sup>g</sup>. If proveR(R*,*X*,*g) outputs Valid, then there exists m <sup>≥</sup> 0 such that, at Step (S2) of iteration m of the main loop, the SAT-solver yields Yes(∅), hence <sup>R</sup><sup>m</sup> <sup>c</sup> <sup>g</sup>. For every iteration <sup>k</sup> in 0 ...m <sup>−</sup> 1 of the main loop, let <sup>w</sup>k, λ<sup>k</sup> = (a<sup>k</sup> <sup>→</sup> <sup>b</sup>k) <sup>→</sup> <sup>c</sup>k be the learned pair and <sup>A</sup><sup>k</sup> the learned assumptions (thus, <sup>R</sup>k, A<sup>k</sup> <sup>c</sup> <sup>b</sup>k). We can apply rule cpl<sup>1</sup> as follows:

$$\frac{R\_k,\ A\_k \vdash\_c b\_k \qquad R\_{k+1},\ X \Rightarrow g \qquad \varphi\_k \qquad \varphi\_k = \bigwedge(A\_k \mid \{a\_k\}) \to c\_k}{R\_0 = R,\ \quad R\_{k+1} = R\_k \cup \{\varphi\_k\}}$$

Accordingly, we can build the derivation of R, X <sup>⇒</sup> g displayed in Fig. 2 and, by Prop. 1, we conclude R, X <sup>i</sup> <sup>g</sup>.

As a corollary, we get the completeness of the calculus C<sup>→</sup>:

**Proposition 4.** *For every r-sequent* σ <sup>=</sup> R, X <sup>⇒</sup> g*,* <sup>C</sup><sup>→</sup> σ *iff* R, X i g*.*

We give two examples of computations using formulas from the ILTP (Intuitionistic Logic Theorem Proving) library [13].

*Example 1.* Let χ be the first instance of problem class SYJ201 from the ILTP library [13], where <sup>η</sup>ij <sup>=</sup> <sup>p</sup><sup>i</sup> <sup>↔</sup> <sup>p</sup><sup>j</sup> and <sup>γ</sup> <sup>=</sup> <sup>p</sup><sup>1</sup> <sup>∧</sup> <sup>p</sup><sup>2</sup> <sup>∧</sup> <sup>p</sup><sup>3</sup>:

$$\chi = ( (\eta\_{12} \to \gamma) \land (\eta\_{23} \to \gamma) \land (\eta\_{31} \to \gamma) ) \to \gamma$$

The clausification of χ yields the triple (R0, X, <sup>g</sup>˜), where <sup>X</sup> contains the implication clauses <sup>λ</sup>0,...,λ<sup>5</sup> defined in Fig. 4 and <sup>R</sup><sup>0</sup> the following 17 clauses (we mark by a tilde the fresh variables introduced during clausification): <sup>3</sup>

$$\begin{array}{ccccccccc}\bar{p}\_{0}\rightarrow\bar{p}\_{4}, & \bar{p}\_{3}\rightarrow p\_{2}, & \bar{p}\_{3}\rightarrow p\_{3}, & \bar{p}\_{4}\rightarrow p\_{1}, & \bar{p}\_{4}\rightarrow\bar{p}\_{3}, & \bar{p}\_{5}\rightarrow\bar{p}\_{4}, & \bar{p}\_{8}\rightarrow\bar{p}\_{4}, \\\ \bar{p}\_{1}\wedge\bar{p}\_{2}\rightarrow\bar{p}\_{0}, & \bar{p}\_{6}\wedge\bar{p}\tau\rightarrow\bar{p}\_{5}, & \bar{p}\_{9}\wedge\bar{p}\_{10}\rightarrow\bar{p}\_{8}, & p\_{1}\wedge p\_{2}\wedge p\_{3}\rightarrow\bar{g}, \\\ p\_{1}\rightarrow\bar{p}\_{2}, & p\_{1}\rightarrow\bar{p}\_{9}, & p\_{2}\rightarrow\bar{p}\_{1}, & p\_{2}\rightarrow\bar{p}\_{7}, & p\_{3}\rightarrow\bar{p}\_{6}, & p\_{3}\rightarrow\bar{p}\_{10}.\end{array}$$

The trace of the computation of proveR(R<sup>0</sup>*,*X*,*g˜) is shown in Fig. 4. Each row displays the validity tests performed by the SAT-solver and the computed answers. If the result is No( ), the last two columns show the worlds <sup>w</sup><sup>k</sup> in the current set <sup>W</sup> and, for each <sup>w</sup><sup>k</sup>, the list of <sup>λ</sup> such that <sup>w</sup><sup>k</sup><sup>W</sup> <sup>λ</sup>; the pair selected at Step (S4) is underlined. For instance, after call (0) we have W <sup>=</sup> {w<sup>0</sup>} and <sup>w</sup><sup>0</sup><sup>W</sup> <sup>λ</sup><sup>k</sup> for every 0 <sup>≤</sup> <sup>k</sup> <sup>≤</sup> 5; the selected pair is <sup>w</sup><sup>0</sup>, λ<sup>0</sup>. After call (1), the set <sup>W</sup> is updated by adding the world <sup>w</sup><sup>1</sup> and <sup>w</sup><sup>1</sup><sup>W</sup> <sup>λ</sup><sup>3</sup>, <sup>w</sup><sup>1</sup><sup>W</sup> <sup>λ</sup><sup>5</sup> and <sup>w</sup><sup>0</sup><sup>W</sup> <sup>λ</sup><sup>k</sup> for every 2 <sup>≤</sup> <sup>k</sup> <sup>≤</sup> 5 (since <sup>w</sup><sup>1</sup> <sup>∈</sup> <sup>W</sup>, we get <sup>w</sup><sup>0</sup> <sup>W</sup> <sup>λ</sup><sup>0</sup>); the selected pair is <sup>w</sup><sup>1</sup>, λ<sup>3</sup>. Whenever the SAT-solver outputs Yes(A), we display the learned clause ϕ<sup>k</sup>. The SAT-solver is invoked 15 times and there are 6 restarts. Fig. 4 also shows the derivation of <sup>R</sup><sup>0</sup>, X <sup>⇒</sup> g˜ extracted from the computation. ♦

*Example 2.* Let ψ be the second instance of problem class SYJ207 from the ILTP library [13], where <sup>η</sup>ij <sup>=</sup> <sup>p</sup><sup>i</sup> <sup>↔</sup> <sup>p</sup><sup>j</sup> and <sup>γ</sup> <sup>=</sup> <sup>p</sup><sup>1</sup> <sup>∧</sup> <sup>p</sup><sup>2</sup> <sup>∧</sup> <sup>p</sup><sup>3</sup> <sup>∧</sup> <sup>p</sup><sup>4</sup>:

$$\psi = ((\eta\_{12} \to \gamma) \land (\eta\_{23} \to \gamma) \land (\eta\_{34} \to \gamma) \land (\eta\_{41} \to \gamma)) \to (p\_0 \lor \neg p\_0 \lor \gamma)$$

<sup>3</sup> With intuit, the set <sup>R</sup><sup>0</sup> consists of the 11 clauses in the first two rows; the remaining 6 clauses are added when the <sup>→</sup>-closure of (R ,X<sup>0</sup> ) is performed (see footnote 2).



$$\begin{array}{c} R\_6, p\_2 \vdash\_c \bar{g} \\ R\_6, p\_1, \bar{p}\_1 \xleftarrow{\mathsf{R}\_6} p\_3 \quad R\_6, X \Rightarrow \bar{\bar{g}} \\ \frac{R\_4, p\_1, \bar{p}\_1 \ \overline{\vdash\_c p\_3} \ \ \ \ \bar{R}\_4, X \Rightarrow \bar{\bar{g}}}{\vdash\_c p\_2 \ \ \ \ \bar{R}\_4, X \Rightarrow \bar{\bar{g}}} \ \ \lambda\_4 \\ \frac{R\_2, p\_1, \bar{p}\_1 \ \overline{\vdash\_c p\_2} \ \ \ \ \overline{R}\_3, X \Rightarrow \bar{\bar{g}}}{\vdash\_c p\_1 \ \ \ \ \ \ \bar{R}\_2, X \Rightarrow \bar{\bar{g}}} \ \ \lambda\_5 \\ R\_0, p\_2, \bar{p}\_6 \ \ \overline{\vdash\_c p\_1 \ \ \ \ \ R\_1, X \Rightarrow \bar{\bar{g}}} \ \ \lambda\_1 \\ R\_0, X \Rightarrow \bar{g} \end{array}$$

**Fig. 4.** Computation of proveR(R<sup>0</sup>,X,g˜), see Ex. 1.



**Fig. 5.** Computation of proveR(R<sup>0</sup>,X,g˜), see Ex. 2.

```
1 procedure prove(R, X, g)
2 // Same Input Ass. and Output Prop. as for intuitR (Fig. 3)
3 s ← newSolver(R); τ ← prAux(X, ∅, g)
4 if τ = Yes(∅) then return Valid else return CountSat
5 procedure prAux(X˜, A˜, q)
6 // Output: Yes(A) or No(M), where A ⊆ A˜ and M ⊆ A˜
7 τ0 ← satProve(s, A˜, q)
8 if τ0 = Yes(A) then return Yes(A)
9 else // τ0 = No(M)
10 for λ = (a → b) → c ∈ X s.t. a ∈ M and b ∈ M and c ∈ M do
11 τ1 ← prAux(X˜ \ {λ}, M ∪ {a}, b)
12 if τ1 = Yes(A) then
13 ϕ ← -
                      (A \ {a}) → c; addClause(s, ϕ)
14 return prAux(X˜, A˜, q)
15 return No(M)
16 end
```
**Fig. 6.** The prove procedure of intuit [2,10].

We proceed as in Ex. 1. The clausification procedure yields (R0, X, g˜), where X consists of the implication clauses <sup>λ</sup>0,...,λ<sup>8</sup> in Fig. 5 and the set <sup>R</sup><sup>0</sup> contains the 24 flat clauses below:

<sup>p</sup><sup>0</sup> <sup>→</sup> g, p ˜ <sup>1</sup> <sup>→</sup> <sup>p</sup>˜<sup>2</sup>, p<sup>1</sup> <sup>→</sup> <sup>p</sup>˜<sup>13</sup>, p<sup>2</sup> <sup>→</sup> <sup>p</sup>˜<sup>1</sup>, p<sup>2</sup> <sup>→</sup> <sup>p</sup>˜<sup>8</sup>, p<sup>3</sup> <sup>→</sup> <sup>p</sup>˜<sup>7</sup>, p<sup>3</sup> <sup>→</sup> <sup>p</sup>˜<sup>11</sup>, p<sup>4</sup> <sup>→</sup> <sup>p</sup>˜<sup>10</sup>, p<sup>4</sup> <sup>→</sup> <sup>p</sup>˜<sup>14</sup>, <sup>p</sup>˜<sup>0</sup> <sup>→</sup> <sup>p</sup>˜<sup>5</sup>, <sup>p</sup>˜<sup>3</sup> <sup>→</sup> <sup>p</sup><sup>3</sup>, <sup>p</sup>˜<sup>3</sup> <sup>→</sup> <sup>p</sup><sup>4</sup>, <sup>p</sup>˜<sup>4</sup> <sup>→</sup> <sup>p</sup><sup>2</sup>, <sup>p</sup>˜<sup>4</sup> <sup>→</sup> <sup>p</sup>˜<sup>3</sup>, <sup>p</sup>˜<sup>5</sup> <sup>→</sup> <sup>p</sup><sup>1</sup>, <sup>p</sup>˜<sup>5</sup> <sup>→</sup> <sup>p</sup>˜<sup>4</sup>, <sup>p</sup>˜<sup>6</sup> <sup>→</sup> <sup>p</sup>˜<sup>5</sup>, <sup>p</sup>˜<sup>9</sup> <sup>→</sup> <sup>p</sup>˜<sup>5</sup> <sup>p</sup>˜<sup>1</sup> <sup>∧</sup> <sup>p</sup>˜<sup>2</sup> <sup>→</sup> <sup>p</sup>˜<sup>0</sup>, <sup>p</sup>˜<sup>7</sup> <sup>∧</sup> <sup>p</sup>˜<sup>8</sup> <sup>→</sup> <sup>p</sup>˜<sup>6</sup>, <sup>p</sup>˜<sup>10</sup> <sup>∧</sup> <sup>p</sup>˜<sup>11</sup> <sup>→</sup> <sup>p</sup>˜<sup>9</sup>, <sup>p</sup>˜<sup>13</sup> <sup>∧</sup> <sup>p</sup>˜<sup>14</sup> <sup>→</sup> <sup>p</sup>˜<sup>12</sup>, <sup>p</sup>˜<sup>12</sup> <sup>→</sup> <sup>p</sup>˜<sup>5</sup>, γ <sup>→</sup> g. ˜

The execution of proveR(R<sup>0</sup>*,*X*,*g˜) (see Fig. 5) requires 14 calls to the SATsolver and 4 restarts. After the last call we get <sup>W</sup> <sup>=</sup> {w7, w8, w<sup>9</sup>} and <sup>w</sup><sup>k</sup> <sup>W</sup> X for every <sup>w</sup><sup>k</sup> <sup>∈</sup> <sup>W</sup>, thus the computation ends yielding CountSat. The model <sup>K</sup>(W), depicted at the bottom left of the figure, is a countermodel for R<sup>0</sup>, X <sup>⇒</sup> g˜ and for ψ (see Sec. 2). ♦

# **5 Related Work and Experimental Results**

We compare the procedure proveR of intuitR with its intuit counterpart, namely the procedure prove defined in Fig. 6. Here we comply with the presentation in [10], equivalent to the original one in [2]. The recursive auxiliary function prAux plays the role of the main loop of proveR (but in proveR the set of atoms A˜ is not used); the loop inside prAux corresponds to the inner loop of proveR. <sup>4</sup> We point out some major differences. Firstly, in prAux the interpretations M computed by the SAT-solver are not collected; in the loop, only the interpretation M computed at line 8 is considered, thus at the beginning of each

<sup>4</sup> Actually intuit implements a variant of prAux where as much as possible clauses ϕ are added to the solver.

$$\begin{array}{llll} \frac{R \vdash\_{c} q}{R, X \Rightarrow q} \text{ cpl}\_{0} & \frac{R\_{1}, b \to c, X, A \Rightarrow b}{R\_{1}, R\_{2}, X, (a \to b) \to c \Rightarrow q} \text{ lyt} \\\frac{R\_{1}, X\_{1} \vdash\_{1} \varphi}{R\_{1}, R\_{2}, X\_{1}, X\_{2} \Rightarrow q} \text{ cut} & \begin{array}{llll} \frac{R\_{1}, R\_{2}, X, (a \to b) \to c \Rightarrow q}{R\_{1}, R\_{2}, X\_{1}, X\_{2} \Rightarrow q} \text{ cut} & \\ & A \subseteq V, q \in V \cup \{\perp\} \\\varphi = \bigwedge(A \mid \{a\}) \to c \end{array} \end{array}$$

**Fig. 7.** The calculus LJTSAT.

iteration just the "local" conditions of the test <sup>M</sup><sup>W</sup> <sup>λ</sup> are checked (line 10). Secondly, the call satProve(s*,* w ∪ {a}*,* b) to the SAT-solver at Step (S5) is replaced by the recursive call prAux(X˜ \ {λ}*,* M ∪ {a}*,* b) at line 11; as a consequence, we cannot build derivations by applying rule cpl1. As thoroughly discussed in [10], the calculus underlying intuit is the sequent calculus LJTSAT in Fig. 7, obtained from <sup>C</sup><sup>→</sup> by replacing the rule cpl<sup>1</sup> with the more general rule ljt and introducing a cut rule. Rule ljt can be seen as a generalization of Dyckhoff's implication-left rule from the calculus LJT (alias G4ip) [3,14]. We remark that a C<sup>→</sup>-derivation is isomorphic to a cut-free LJTSAT-derivation where, in every application of rule ljt, the left-premise has a trivial proof (just apply rule cpl0). In [10] it is shown how countermodels and LJTSAT-derivations can be extracted from prove computations. In brief, countermodels are obtained by considering some of the interpretations coming from No( ) answers; countermodels are in general bigger than the ones built by proveR, where at each restart the model is emptied. As an example, let <sup>σ</sup><sup>0</sup> <sup>=</sup> <sup>R</sup>0, X <sup>⇒</sup> <sup>g</sup>˜ be defined as in Ex. 2; the computation of prove(R<sup>0</sup>*,*X*,*g˜) requires 31 calls to the SAT-solver (24 No( ) answers) and the computed countermodel for <sup>σ</sup><sup>0</sup> has 6 worlds (see Fig. 5); instead, proveR(R<sup>0</sup>*,*X*,*g˜) requires 14 calls and the countermodel has 3 worlds. Derivation extraction presents some awkward aspects. The key insight is that, for every recursive call prAux(X˜*,*A˜*,*q) occurring in the computation of prove(R*,*X*,*g), if prAux(X˜*,*A˜*,*q) returns Yes(A) (where A <sup>⊆</sup> A˜), then we can build an LJTSAT-derivation of a sequent R, R , A, X˜ <sup>⇒</sup> q, where R contains some of the clauses added to the SAT-solver. The derivation is built either by applying the rule cpl<sup>0</sup> if prAux ends at line 8, or else by applying rule ljt, exploiting the derivations obtained by the recursive calls at lines 11 and 14. Accordingly, the main call prove(R*,*X*,*g) yields a derivation of R, R , X <sup>⇒</sup> g. The crucial point is that the redundant clauses ϕ in R satisfy R, X <sup>i</sup> <sup>ϕ</sup> (this ultimately follows by property (a) in Sec. 3), thus we can eliminate them by applying the cut rule.

*Example 3.* Let <sup>σ</sup><sup>0</sup> <sup>=</sup> <sup>R</sup><sup>0</sup>, X <sup>⇒</sup> <sup>g</sup>˜ be defined as in Ex. 1; prove(R<sup>0</sup>*,*X*,*g˜) yields the LJTSAT-derivation <sup>D</sup><sup>0</sup> of <sup>R</sup><sup>2</sup>, ϕ<sup>4</sup>, X <sup>⇒</sup> <sup>g</sup>˜ in Fig. 8. By applying the cut rule three times, we get an LJTSAT-derivation of σ<sup>0</sup>. We stress that the C<sup>→</sup>-derivation of <sup>σ</sup><sup>0</sup> obtained with intuitR (see Fig. 4) has a simpler structure.

Finally, we remark that the clauses ϕ computed in prAux do not enjoy property (P2) (Sec. 4); we have experimented cases where such clauses are even duplicated (e.g., with formulas from class SYJ205 of ILTP library).


<sup>λ</sup><sup>2</sup> = (p<sup>2</sup> <sup>→</sup> <sup>p</sup><sup>3</sup>) <sup>→</sup> <sup>p</sup>˜<sup>6</sup> <sup>λ</sup><sup>4</sup> = (p<sup>1</sup> <sup>→</sup> <sup>p</sup><sup>3</sup>) <sup>→</sup> <sup>p</sup>˜<sup>10</sup> <sup>λ</sup><sup>5</sup> = (p<sup>1</sup> <sup>→</sup> <sup>p</sup><sup>2</sup>) <sup>→</sup> <sup>p</sup>˜<sup>1</sup> <sup>ϕ</sup><sup>0</sup> = ˜p<sup>6</sup> <sup>→</sup> <sup>p</sup>˜<sup>2</sup> <sup>ϕ</sup><sup>1</sup> = ˜p<sup>10</sup> <sup>→</sup> <sup>p</sup>˜<sup>1</sup> <sup>ϕ</sup><sup>2</sup> = ˜p<sup>7</sup> <sup>ϕ</sup><sup>3</sup> = ˜p<sup>9</sup> <sup>ϕ</sup><sup>4</sup> = ˜p<sup>1</sup> <sup>→</sup> <sup>p</sup>˜<sup>10</sup> <sup>ϕ</sup><sup>5</sup> = ˜p<sup>6</sup> <sup>X</sup><sup>I</sup> <sup>=</sup> <sup>X</sup> \ { <sup>λ</sup><sup>k</sup> <sup>|</sup> <sup>k</sup> <sup>∈</sup> <sup>I</sup> } <sup>R</sup><sup>k</sup>+1 <sup>=</sup> <sup>R</sup><sup>k</sup> ∪ {ϕ<sup>k</sup>}

*Experimental results* We have implemented intuitR in Haskell on the top of intuit: we have replaced the function prove with proveR and added some features (e.g., trace of computations, construction of derivations/countermodels); as in intuit, we exploit the module MiniSat, a Haskell bundle of the MiniSat SAT-solver [4] (but in principle we can use any incremental SAT-solver). We compare intuitR with intuit and with two of the state-of-the-art provers for IPL by replicating the experiments in [2]. The first prover is fCube [5]; it is based on a standard tableaux calculus and exploits a variety of simplification rules [6] that can significantly reduce branching and backtracking. The second prover is intHistGC [11]; it relies on a sequent calculus with histories and uses dependency directed backtracking for global caching to restrict the search space; we run it with its best flags (-b -c -c3). All tests were conducted on a machine with an Intel i7-8700 CPU@3.20GHz and 16GB memory. We considered the benchmarks provided with intuit implementation, including the ILTP library, the intHistGC benchmarks and the API problems introduced by intuit developers. This amounts to a total of 1200 problems, 498 Valid and 702 CountSat; we used a 600s (seconds) timeout. Fig. 9 reports the more significant results, among which the classes where at least a prover fails and the classes where intuitR performs poorly. In all the tests, the time required by clausification is negligible. Even though no optimized data structure has been implemented, intuitR solve more problems than its competitors; in families SYJ201 (Valid formulas) and SYJ207 (CountSat formulas) intuitR outperforms its rivals, in all the other cases, except the families EC, negEC and portia, intuitR is comparable to the best prover (which is intuit in most cases). The most remarkable improvement with respect to intuit occurs with class SYJ212 (see Fig. 10), where intuit timings are fluc-

**Fig. 8.** Derivation <sup>D</sup><sup>0</sup> of <sup>R</sup><sup>2</sup>, ϕ<sup>4</sup>, X <sup>⇒</sup> <sup>g</sup>˜ in LJTSAT (see Ex. 3).


**Fig. 9.** For each prover, we report the number of solved problems within 600s timeout and between brackets the total time in seconds required for the solved problems. The best prover is highlighted, a star reports that there are some unsolved problems.

tuating. To give a close comparison, let us consider the case k = 25; clausification produces 246 flat clauses and 100 implications clauses (176 atoms). Our intuit implementation requires 11214 calls to the SAT-solvers (10181 No( )) and the computed countermodel has 1955 worlds. Instead, intuitR requires 45 calls to the SAT-solvers, 8 restarts and yields a countermodel consisting of 4 worlds; the set W contains 26 worlds before the first restart, one world before the remaining ones. With all the benchmarks the models generated during the computation are small (typically, big models occur before the first restart); however, differently from [7,8,9], we cannot guarantee that countermodels have minimum depth or minimum number of worlds. To complete the picture, the scatter plot in Fig. 11 compares intuitR and intuit on all the benchmarks.


(... ((¬¬p<sup>1</sup> ↔ p2) ↔ p3) ↔ ... ↔ pk) ↔ (... ((p<sup>1</sup> ↔ p2) ↔ p3) ↔ ... ↔ pk) ¬α := α → ⊥

**Fig. 10.** Timings for problems k = 1..50 of SYJ212 (CountSat), - means timeout (600s).

**Fig. 11.** Comparison between intuitR and intuit (1172 problems, the 28 problems where both provers run out of time have been omitted); time axis are logarithmic, the 8 red squares indicates that intuit has exceeded the timeout.

To conclude, we point out that intuitR can be extended to deal with some superintuitionistic logics [1]. For instance, let us consider the G¨oedel-Dummett logic GL, characterized by linear models; at any step of the computation of proveR, the model <sup>K</sup>(W) must be kept linear. Whenever the insertion of a new world to W breaks linearity, we follow a "restart with learning" strategy [12]: let γ = (a <sup>→</sup> b) <sup>∨</sup> (b <sup>→</sup> a) be the instance of the GL-axiom falsified at the root of <sup>K</sup>(W); we restart by taking γ as "learned axiom", so to avoid the repetition of the flaw. However, we cannot add γ to the SAT-solver, because γ is not a clause, but the clausification of <sup>γ</sup>, namely the clauses ˜q<sup>1</sup> <sup>∨</sup> <sup>q</sup>˜2, ˜q<sup>1</sup> <sup>∧</sup> <sup>a</sup> <sup>→</sup> <sup>b</sup>, ˜q<sup>2</sup> <sup>∧</sup> <sup>b</sup> <sup>→</sup> <sup>a</sup>, where ˜q<sup>1</sup> and ˜q<sup>2</sup> are fresh atoms; despite the language of the SAT-solver must be extended, the process converges. The other generalizations suggested in [2] (modal logics, fragments of first-order logic) seem to be more challenging.

**Acknowledgments.** I am grateful to the reviewers for their valuable suggestions. This work has been funded by the INdAM-GNCS project 2020 "Estensioni del *Property-based Testing* di e con linguaggi di programmazione dichiarativa".

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Proof Search and Certificates for Evidential Transactions**

Vivek Nigam<sup>1</sup> , Giselle Reis<sup>2</sup> , Samar Rahmouni<sup>2</sup> , and Harald Ruess<sup>3</sup>

<sup>1</sup> Huawei Munich Research Center, Munich, Germany vivek.nigam@gmail.com srahmoun@andrew.cmu.edu <sup>3</sup> fortiss GmbH, Munich, Germany ruess@fortiss.org <sup>2</sup> Carnegie Mellon University, Ar-Rayyan, Qatar giselle@cmu.edu,

**Abstract.** Attestation logics have been used for specifying systems with policies involving different principals. Cyberlogic is an attestation logic used for the specification of Evidential Transactions (ETs). In such transactions, evidence has to be provided supporting its validity with respect to given policies. For example, visa applicants may be required to demonstrate that they have sufficient funds to visit a foreign country. Such evidence can be expressed as a Cyberlogic proof, possibly combined with non-logical data (e.g., a digitally signed document). A key issue is how to construct and communicate such evidence/proofs. It turns out that attestation modalities are challenging to use established proof-theoretic methods such as focusing. Our first contribution is the refinement of Cyberlogic proof theory with knowledge operators which can be used to represent knowledge bases local to one or more principals. Our second contribution is the identification of an executable fragment of Cyberlogic, called Cyberlogic programs, enabling the specification of ETs. Our third contribution is a sound and complete proof system for Cyberlogic programs enabling proof search similar to search in logic programming. Our final contribution is a proof certificate format for Cyberlogic programs inspired by Foundational Proof Certificates as a means to communicate evidence and check its validity.

**Keywords:** Attestation Logics · Proof Search · Sequent Calculus

# **1 Introduction**

Attestation logics [1,14,21,15,6,5,29] have been used for the specification of policies of distributed systems, such as access control systems [1], distributed authorization policies [14,21], and evidential transactions (ETs) [15,5,6,6,29]. In these logics, one specifies policies involving attestation formulas of the form K:- F, where K is a principal (or agent) in the system.

Cyberlogic is an attestation logic for ETs. In Cyberlogic, cryptographic keys K are identified with specific authorities, and attestations K:- A express the fact that principal K attests to statement A. For example, K may be a visagranting authority and A the statement that the visa requester is authorized

c The Author(s) 2021 A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5 14 234–251, 2021.

to enter the specified country by the end of the year and at most once. An evidential transaction might issue a visa given that proof of sufficient funds has been provided in the form of a digital certificate whose validity can then be verified by customs authorities upon entry.

Formally, evidence in ETs can be expressed as a Cyberlogic proof. To carry out an ET, a Cyberlogic proof demonstrating policy compliance shall be produced and communicated. ETs therefore enable trust in, for example, distributed exchanges in electronic commerce, by enabling the exchange of various forms of *verifiable evidence*, such as evidence of funds in the visa example above.

The problem of producing attestation logic proofs (and proof objects) has not been given enough attention so far. Attestation logics have been formalized as Hilbert-style proof systems [1,15] that do not have the sub-formula property and therefore are not suitable for proof search. Other works on authorization logics [14,21] have proposed sequent calculi which do possess the sub-formula property. However, the search space is too great to enable efficient proof search.

The established proof-theoretic method for proof search is *focusing* [3,18]. Focusing distinguishes between inference rules that have "don't know" and "don't care" non-determinism to prune the proof search space. Interestingly, focused proof systems [7,18] provide a proof-theoretical justification for backward and forward-chaining, two proof-search strategies for Horn clauses (logic programs). Such justification, however, breaks when programs contain modalities, such as attestation modalities, *i.e.*, formulas of the form K:- F. This is because focusing is lost whenever any of these formulas is encountered and therefore, improvements to the search space because of focusing is not so significant for attestation logics.

Our main goal is the study of Cyberlogic's proof theory in order to enable proof search (similar to the search involved in logic programming) and the generation of proof certificates for the communication of evidence in ETs.

Our first contribution, detailed in Section 2, is a Gentzen style proof system for Cyberlogic that admits cut elimination. A feature of the proof system is that it enables the combination of evidence represented as logical derivations as well as digital evidence, *e.g.*, signed hashes of documents, financial statements, medical records. The logic also includes a knowledge operator for sets of principals.

Our second contribution, detailed in Section 3, is the identification of a fragment of Cyberlogic, called Cyberlogic programs, akin to Horn clauses used in logic programming. This is motivated by the ongoing work on building distributed logic programming engines for ETs which extend existing engines [10] with attestations of the form K:-A.

Our third contribution, also detailed in Section 3, addresses the challenge of how to efficiently construct Cyberlogic program proofs. We propose a focused inspired proof system for Cyberlogic programs and prove that it is sound and complete in this fragment. This system enables more efficient proof search.

Our last contribution, detailed in Section 4, addresses the challenge of how to efficiently communicate evidence. We propose a proof certificate format for Cyberlogic programs inspired by Foundational Proof Certificates (FPCs) [9]. FPCs enable the reconstruction of proofs by using simple logic programs as guides. This

Γ, A −→ <sup>A</sup> init evidenceKA <sup>Γ</sup> −→ <sup>K</sup> :- <sup>A</sup> ext <sup>Γ</sup> −→ <sup>r</sup> Γ, ⊥ −→ <sup>C</sup> <sup>⊥</sup><sup>l</sup> Γ, F1, F<sup>2</sup> −→ G Γ, F<sup>1</sup> <sup>∧</sup> <sup>F</sup><sup>2</sup> −→ <sup>G</sup> <sup>∧</sup><sup>l</sup> Γ −→ F<sup>1</sup> Γ −→ F<sup>2</sup> Γ −→ F<sup>1</sup> ∧ F<sup>2</sup> ∧<sup>r</sup> Γ, F<sup>1</sup> −→ G Γ, F<sup>2</sup> −→ G Γ, F<sup>1</sup> <sup>∨</sup> <sup>F</sup><sup>2</sup> −→ <sup>G</sup> <sup>∨</sup><sup>l</sup> Γ −→ F<sup>i</sup> Γ −→ F<sup>1</sup> ∧ F<sup>2</sup> ∨<sup>r</sup><sup>i</sup> Γ, F<sup>1</sup> ⊃ F<sup>2</sup> −→ F<sup>1</sup> Γ, F<sup>2</sup> −→ G Γ, F<sup>1</sup> <sup>⊃</sup> <sup>F</sup><sup>2</sup> −→ <sup>G</sup> <sup>⊃</sup><sup>l</sup> Γ, F<sup>1</sup> −→ F<sup>2</sup> Γ −→ F<sup>1</sup> ⊃ F<sup>2</sup> ⊃<sup>r</sup> Γ, ∀x.F, F[t/x] −→ G Γ, <sup>∀</sup>x.F −→ <sup>G</sup> <sup>∀</sup><sup>l</sup> Γ −→ F[α/x] <sup>Γ</sup> −→ ∀x.F <sup>∀</sup><sup>r</sup> Γ, F[α/x] −→ G Γ, <sup>∃</sup>x.F −→ <sup>G</sup> <sup>∃</sup><sup>l</sup> Γ −→ F[t/x] <sup>Γ</sup> −→ ∃x.F <sup>∃</sup><sup>r</sup> Γ, F −→ <sup>K</sup> :- G Γ, K :- <sup>F</sup> −→ <sup>K</sup> :- <sup>G</sup> :<sup>l</sup> <sup>Γ</sup> −→ <sup>F</sup> <sup>Γ</sup> −→ <sup>K</sup> :- <sup>F</sup> :r Γ, kbQF, F −→ G Γ, kbQ<sup>F</sup> −→ <sup>G</sup> kb<sup>l</sup> Γ |Q−→ F <sup>Γ</sup> −→ kbQ<sup>F</sup> kb<sup>r</sup>

**Fig. 1.** CL<sup>K</sup> – Cyberlogic proof system for K = {K1,..., Kn}. Here A is an atomic formula, Q⊆K, and Γ |Q= {kbQ-F | kbQ-F ∈ Γ ∧ Q- ⊆ Q}. Moreover, in rules ∃<sup>L</sup> and ∀R, α is a fresh constant not appearing in Γ nor F.

means that such certificates can elide parts that can be easily reconstructed or which one is willing to reconstruct.

## **2 Cyberlogic Proof Theory**

Cyberlogic [29] is an intuitionistic modal logic which can be used for specifying ETs. The logic is parametrized by a finite set of principals K = {K1,...,Kn}, which are used in formulas as follows:


External evidences are left unspecified since they fall outside the logical scope and depend on the ET being formalized. For example, evidence<sup>K</sup>iA could be signed hashes of tickets, financial statments, medical records, etc. In Cyberlogic the evidence associated with an ET is a combination of a formal proof (in sequent calculus) and a collection of external evidences.

Cyberlogic formulas are constructed according to the following grammar:

$$F, G ::= A \mid F \land G \mid F \lor G \mid F \supset G \mid \top \mid \bot \mid \mathsf{K} \land \succ F \mid \mathsf{kb}\_{\mathsf{Q}} F \mid \forall x. F \mid \exists x. F$$

where <sup>A</sup> is an atom, <sup>K</sup> ∈ K, and Q⊆K. The formula <sup>K</sup>:- F is read as "principal K attests F" and acts like the *says* modality in lax logics [13,27]. The formula kbQF is read as "principals in Q know F" and is inspired by the *knows* modality used in linear authorization logics [14,21]. Different from that logic, Cyberlogic allows the direct specification of knowledge shared by multiple principals, as illustrated in Example 1.

Cyberlogic sequents are of the shape Γ −→ G, where Γ is a multiset of formulas. The Cyberlogic proof system, CLK, is depicted in Figure 1. Rules for the intuitionistic connectives ∧, ∨, ⊃, ∀, ∃ are as in LJ [30]. The new rules are the ones involving assertions K:- F and kbQ. Note that a "built-in" contraction of the main formula is needed on the left premise of ⊃<sup>l</sup> and the premise of ∀l, as expected in intuitionistic logics. Also, the rule kb<sup>l</sup> has an explicit contraction on the premise. These contractions are needed for cut admissibility (Theorem 2).

Rules :<sup>l</sup> and :<sup>r</sup> specify that : is a lax modality [27,21,24]. The intuition behind :<sup>l</sup> is: if an assertion G of a principal K is provable using F, then it is also provable if K attests F. Rule :<sup>r</sup> specifies that principals are rational, *i.e.*, they can always attest formulas that are derivable. Differently from existing systems with lax modalities, CL<sup>K</sup> has the rule ext. This rule allows a proof of an attestation K:- A to be completed whenever a principal provides evidence evidenceKA for the claim A. This formalizes the intuition that principals may use digital evidence signed by their private key. We leave the definition of evidence unspecified as it depends on the intended ET specified.

Rules kb<sup>l</sup> and kb<sup>r</sup> refine Cyberlogic by enabling the collection of logical theories known by a set of principals. Such theories act as *knowledge bases*. Rule kb<sup>l</sup> specifies that any common knowledge can be part of a knowledge base. The interesting rule is kbr, which specifies that kbQF can only be proved using the local knowledge or evidence provided by principals in Q. This is formally captured by restricting Γ in kbr's premise to the set Γ |Q= {kbQ-F | kbQ-F ∈ Γ ∧ Q ⊆ Q}. This is a powerful construct that increases the expressiveness of Cyberlogic. In particular, it is straightforward to specify that certain assertions can be concluded from the shared knowledge of a set of principals.

**Proposition 1.** *The following sequents are provable in* CL<sup>K</sup> *for all* K ∈ K *and formulas* F1, F2*.* F<sup>1</sup> ≡ F<sup>2</sup> *represents the sequents* (F<sup>1</sup> −→ F2) *and* (F<sup>2</sup> −→ F1)*:*


*Moreover, the following sequents are not provable if* K<sup>1</sup> = K<sup>2</sup> *and* Q<sup>1</sup> = Q2*:*

1. <sup>K</sup> :- F −→ F 2. <sup>F</sup> −→ kbQ<sup>F</sup> 3. <sup>K</sup> :- F −→ kb{K}F 4. <sup>K</sup><sup>1</sup> :-(K<sup>2</sup> :- <sup>F</sup>) −→ <sup>K</sup><sup>2</sup> :-(K<sup>1</sup> :- F) 5. kbQ<sup>1</sup> (kbQ<sup>2</sup> <sup>F</sup>) −→ kbQ<sup>2</sup> (kbQ<sup>1</sup> <sup>F</sup>) 6. kbQ1∪Q<sup>2</sup> <sup>F</sup> −→ kb<sup>Q</sup>iF, <sup>i</sup> ∈ {1, <sup>2</sup>} 7. kbQ1∪Q<sup>2</sup> <sup>F</sup> −→ kbQ<sup>1</sup> <sup>F</sup> <sup>∧</sup> kbQ<sup>2</sup> <sup>F</sup> 8. kbQ<sup>K</sup> :- <sup>A</sup> −→ <sup>K</sup> : kbQA 9. <sup>K</sup> : kbQ<sup>A</sup> −→ kbQ<sup>K</sup> :-A

In the remainder of the paper, we elide the set of principals K whenever it can be deduced from the context.

*Example 1.* **(Shared Knowledge)** The ability to use kb with multiple principals allows the derivation of facts that depend on the combination of knowledge of multiple principals. Consider that principal K<sup>1</sup> knows A and B ⊃ C, and principal K<sup>2</sup> knows A ⊃ B, then the following sequent is provable in CL:

$$\mathsf{kb}\_{\{\mathsf{K}\_{1}\}}A, \mathsf{kb}\_{\{\mathsf{K}\_{1}\}}B \supset C, \mathsf{kb}\_{\{\mathsf{K}\_{2}\}}A \supset B \longrightarrow \mathsf{kb}\_{\{\mathsf{K}\_{1},\mathsf{K}\_{2}\}}C$$

*Remark 1.* The original Cyberlogic paper [5] (and technical report [4]) proposed two kinds of attestations, : and -, to distinguish when an attestation is derived from a digital evidence or logical inferences. This combination, however, does not yield to a proof system with the cut-elimination property [28].

The meta-theory of CL has been analysed using the L-framework [25], which uses rewriting logic to automatically derive structural proofs of sequent calculi properties [26]. The following lemma was used in the proofs of cut-elimination and invertibility.

**Lemma 1.** *If* Γ,K:-F −→ G*, then* Γ, F −→ G*.*

The proof proceeds by structural induction on the derivation of Γ,K:- F −→ G. The proof has been mechanically checked using the the L-framework with some few cases proved by hand.

As expected, ⊃r, ∧r, ∧l, ∨l, ∀r, ∃<sup>l</sup> are invertible whereas ∨r, ⊃l, ∀l, ∃<sup>r</sup> are not invertible. In addition, the rules :<sup>l</sup> and kb<sup>l</sup> are invertible whereas the :<sup>r</sup> and kb<sup>r</sup> are not invertible.

**Lemma 2.** *If* Γ,K:- <sup>F</sup> −→ <sup>K</sup>:- <sup>G</sup> *then* Γ, F −→ <sup>K</sup>:-G*.*

This is a simple corollary of Lemma 1. Invertibility of kb<sup>l</sup> is straighforward because of the contraction of the main formula.

Rules :<sup>r</sup> and kb<sup>r</sup> are not invertible. The counter examples are:

$$\begin{aligned} [: \rhd\_r] &\quad \mathsf{K}: \rhd a \longrightarrow \mathsf{K}: \rhd a \text{ but } \mathsf{K}: \rhd a \stackrel{\scriptstyle}{\longrightarrow} a \\\ [\mathsf{kb}\_r] &\quad a, a \supset \mathsf{kb}\_\mathsf{K} b \longrightarrow \mathsf{kb}\_\mathsf{K} b \text{ but } \stackrel{\scriptstyle}{\longrightarrow} \rhd b \end{aligned}$$

Weakening is height perserving admissible in CL.

**Theorem 1 (Identity expansion).** F −→ F *is provable in* CL *for any cyberlogic formula* F*.*

The proof is by structural induction on F.

# **Theorem 2 (Cut elimination).** *If* Γ −→ F *and* Γ, F −→ C*, then* Γ −→ C*.*

The proof proceeds by a nested induction on the structure of the proofs of Γ −→ F and Γ, F −→ C, and the formula F. The noteworthy cases are the ones where cut needs to permute over kb rules. For kbl, contraction of the main formula is needed, and the permutation over kb<sup>r</sup> can be done only if cut is principal on the left (which is a lemma that can be proved). Details about these transformations are in Appendix A.

#### **3 Cyberlogic Programs**

*Cyberlogic programs* are fragment of CL which resembles Horn clauses in logic programming. Section 3.2 proposes a proof search operational semantics for cyberlogic programs and proves its soundness and completeness. The proof search discipline relies on ideas from focusing [3]. Focused proof systems for LJ [18] provide a proof theoretical justification of forward and backward chaining search. Each technique is enforced by the choice of polarity of atomic formulas: positive atoms lead to forward chaining and negative atoms lead to backward chaining. This correspondence, however, does not extend to cyberlogic due to attestation formulas K:- A which cause focusing to be lost [21]. Consider the following example where the formula under focus is in brackets:

$$\frac{\mathbb{K}\_1 \mathrel{\rightleftharpoons} a \longrightarrow [\mathbb{K}\_1 \mathrel{\rightleftharpoons} a] \quad \mathbb{K}\_1 \mathrel{\rightleftharpoons} a, [\mathbb{K}\_2 \mathrel{\rightleftharpoons} b] \longrightarrow \mathbb{K}\_2 \mathrel{\rightleftharpoons} b}{\mathbb{K}\_1 \mathrel{\rightleftharpoons} a, [\mathbb{K}\_1 \mathrel{\rightleftharpoons} a \supset \mathbb{K}\_2 \mathrel{\rightleftharpoons} b] \longrightarrow \mathbb{K}\_2 \mathrel{\rightleftharpoons} b} \supset \mathbb{h}$$

In focused proof systems, forward chaining can be enforced by disallowing focus to be lost on the right formula in the left premise, *i.e.* [K<sup>1</sup> : a]. However, if :<sup>r</sup> is applied to this sequent the premise would be K<sup>1</sup> : a −→ a, which is not provable (see Proposition 1). In fact, [K<sup>1</sup> : a] must lose focus on the right for the proof to be completed. Therefore, if : modalities are used in logic programs, other strategies for proof search need to be analysed.

#### **3.1 Cyberlogic Program Syntax**

Cyberlogic programs can be divided into goals, knowledge bases, common knowledge, and attestation clauses.

*Goals (*G*)* Cyberlogic programs are used to derive a goal G, defined as:

$$G ::= \top \mid \mathsf{K} \mathbin{\rightsquigarrow} \mathsf{kb}\_{\mathsf{Q}} A \mid G\_1 \land G\_2 \mid \exists x. G \mid$$

where A is an atomic formula. The restriction of : kb<sup>Q</sup> to atoms does not reduce the expressiveness of goals, given the equivalences in Proposition 1.

*Knowledge Bases* (B)*:* A knowledge base, written kb{Ki}Γ, of a principal K<sup>i</sup> ∈ K is a set of formulas Γ not containing the connectives : or kb. Here, kb{Ki}Γ represents the set of formulas {kb{Ki}F | F ∈ Γ}.

Intuitively, a knowledge base kb{Ki}Γ can be interpreted as Ki's local knowledge. This means that K<sup>i</sup> may use its own prover to derive new facts. For example, if Γ is a collection of Horn-clauses, then K<sup>i</sup> may deploy a Prolog engine to derive some goal. Alternatively if Γ is a set of formulas in CNF form, then K<sup>i</sup> may use resolution provers. The absence of modal connectives in knowledge bases has important impacts on the design of the proof certificate described in Section 4, as those may rely on existing certificates for different provers [9].

*Common Knowledge (*C*):* Common knowledge are knowledge bases that are known to all principals, written as kb<sup>∅</sup> Γ. Since ∅⊆Q for every Q, these formulas remain in the context when applying kbr. In this sense they contain first order formulas that may be used by all principals.

*Attestation Formulas (*D*):* Formulas of the form <sup>K</sup>: kbQA are derived by attestation formulas of the form below where for all 1 ≤ i ≤ n, K<sup>i</sup> ∈ K, Q<sup>i</sup> ⊆ K, and A1,...,An, A are atomic formulas and X are bounded by universal quantifiers:

$$\begin{split} &\forall \vec{X}. \Big(\mathsf{kb}\_{\mathsf{Q}\_{1}}(\mathsf{K}\_{1} \mathrel{\scalebox{0.555}{ $\mathsf{k}$ }} \mathsf{A}\_{1}) \wedge \cdots \wedge \mathsf{kb}\_{\mathsf{Q}\_{n}}(\mathsf{K}\_{n} \mathrel{\rule{0.555}{ $\mathsf{k}$ }} \mathsf{A}\_{n}) \wedge G \supset \mathsf{K} : \rhd (\mathsf{kb}\_{\emptyset} \boldsymbol{A}) \Big) \\ &\forall \vec{X}. \Big(\mathsf{kb}\_{\mathsf{Q}\_{1}}(\mathsf{K}\_{1} \mathrel{\rule{0.555}{ $\mathsf{k}$ }} \boldsymbol{A}\_{1}) \wedge \cdots \wedge \mathsf{kb}\_{\mathsf{Q}\_{n}}(\mathsf{K}\_{n} \mathrel{\rule{0.555}{ $\mathsf{k}$ }} \boldsymbol{A}\_{n}) \wedge G \supset \mathsf{K} : \rhd (\mathsf{kb}\_{\{\mathsf{k}\}} \boldsymbol{A}) \Big). \end{split}$$

Intuitively, an attestation formula belongs to a principal, namely K in the right-hand side of ⊃. Such formulas derive K's attestation of an atomic formula which is its own knowledge (kb{K}A), or common knowledge (kb∅A). This means that K's attestation formulas cannot derive knowledge belonging to other principals. Furthermore to derive an attestation, one can use the knowledge base of other principals, *i.e.* the formulas kb<sup>Q</sup><sup>i</sup> (K<sup>i</sup> :- Ai) or additional goals, *i.e.* G. Finally notice that K:-(kb∅A) and <sup>K</sup>:-(kb{K}A) are attestation formulas themselves, where the left-hand side of ⊃ is empty (denoting ).

The difference between formulas K:- A and K:-(kb{K}A) is subtle. Note that the former can be derived using the evidence rule ext, while the latter cannot. K:-(kb{K}A) is K's attestation that A follows from its local knowledge base. It is possible to specify that A can be derived from an external evidence, but this has to be made explicit by an attestation formula, *e.g.*, kb{K}(K:- A) ⊃ K:-(kb{K}A). Note that this formula is not a tautology.

We are interested in proving goals from attestation formulas, knowledge bases, and common knowledge, which are formally represented by cyberlogic program sequents defined as follows.

**Definition 1 (Cyberlogic Program Sequents (**CPS**)).** *A* cyberlogic program sequent (CPS) *is a sequent* C, B, D −→ G*, where* B *is a set of knowledge bases,* C *is a set of common knowledge formulas,* D *is a set of attestation formulas, and* G *is a goal formula.*

*Example 2.* **(Local Computations)** This example illustrates the use of kb to specify when parts of a derivation can be proved locally using a principal's knowledge. Consider that the following clause

$$\mathsf{kb}\_{\{\mathsf{K}\_{1}\}}(\mathsf{K}\_{1}\mathrel{\rightrightarrows}F\_{1}) \land \mathsf{kb}\_{\{\mathsf{K}\_{2}\}}(\mathsf{K}\_{2}\mathrel{\rightrightarrows}F\_{2}) \supset \mathsf{K} \mathrel{\rightrightarrows}\mathsf{kb}\_{\{\mathsf{K}\}}G\_{1}$$

specifies that for K to attest G, K<sup>1</sup> and K<sup>2</sup> have to attest F<sup>1</sup> and F<sup>2</sup> respectively, using *their own local theories, common knowledge, or evidence*. This means that computations carried out by K<sup>1</sup> and K<sup>2</sup> to derive their assertions K<sup>1</sup> :- F<sup>1</sup> and K<sup>2</sup> :- F<sup>2</sup> respectively, do not depend on other principals and therefore, the search for these derivations can be performed locally.

*Example 3.* (**Levels of Trust**) This example illustrates the use of kb to specify that some evidence should only be trusted if derived from trusted sources. Consider three principals K = {K<sup>T</sup> ,K<sup>U</sup> ,K} where K trusts evidence from K<sup>T</sup> , but not all evidence from K<sup>U</sup> . Then the following clause

```
kb{K,KT }(K :-
             critical(ok)) ∧ kbK(K :-
                                      nonCritical(ok)) ⊃ K :-
                                                              kb∅(all(ok))
```
specifies that K can attest that everything is ok as a common knowledge if all the non-critical and critical elements are ok. However, the check of critical parts can only be performed by principals K trusts, namely K itself or K<sup>T</sup> . Information from K<sup>U</sup> 's knowledge bases cannot be used in the proof of critical(ok).

*Example 4.* (**Simplified Visa**) Consider a visa issuing scenario where an applicant applies to a consulate (cons) for an entry visa. This is an example of an ET as, to obtain the visa, evidence has to be provided that, for example, the applicant has no crime records, or that they have sufficient funds. We illustrate how such an ET can be specified in Cyberlogic.

The formula below labelled **main** specifies conditions for a visa to be issued:

```
main: ∀Id.∀Doc.∀V.
                    -

                      kb{cons}(cons:-
                                     visitOk(Id, Doc))
       ∧ kb{cons}(cons:-
                        prepVisa(Id, V))
       ∧ cons:-
                kb{cons}(sufFin(Doc)) ∧ police :-
                                                 kb{police}(noCrimeRec(Id))
       ⊃ cons:-
                kbcons(issVisa(Id, Doc, V))
```
The transaction for cons issuing a visa V to an applicant Id requires cons to attest validity of Id's visit by itself (visitOk(Id, Doc)) and Id's criminal record with the help of the police (noCrimeRec(Id)). In addition, cons also needs to attest Id's financial status (sufFin(Doc)).

The following two clauses expand on how cons can attest sufFin(Doc): either via an employment contract or a bank statement.

$$\begin{array}{c} \mathsf{cont} \colon \mathsf{kb}\_{\{\mathsf{cons}\}} \Big( \forall \mathsf{Doc}. \forall \mathsf{Cont.} \Big( \mathsf{empContract}(\mathsf{Doc}, \mathsf{Cont}) \land \mathsf{valid}(\mathsf{Cont}) \\ \qquad \supset \mathsf{sufFin}(\mathsf{Doc})) \Big) \\ \mathsf{bankStmtt: } \forall \mathsf{Doc}. \forall \mathsf{Stmtt.} \Big( \mathsf{kb}\_{\{\mathsf{cons}\}}(\mathsf{cons} \mathrel{\mathrel{\ }} \mathsf{bankStmtt}(\mathsf{Doc}, \mathsf{Stmtt})) \Big) \end{array}$$

<sup>∧</sup> bank : kb{bank}(valid(Stmt)) <sup>⊃</sup> cons:kb{cons}(sufFin(Doc))

The formula labeled **cont** belongs to cons's knowledge base. This means that cons can check the validity of an employment contract without evidence from other principals. For example, valid(Cont) may check the contract duration and salary. The formula labeled **bankStmt**, on the other hand, takes the bank statement Stmt from the given documents, Doc, and requires the bank to validate it using its knowledge base. This makes sense as Id's financial records are sensitive and do not need to be disclosed to anyone else apart from her financial institute.

These clauses also illustrate the subtle difference between goal formulas K: kb{K}<sup>F</sup> and knowledge base formulas kb{K}K:- F . For example, in the **main** clause, the fact that applicant has come to their appointment at the consulate does not depend on other agents and that is why we use a knowledge base formula. The same applies to the visa preparation. On the other hand, the fact that applicant has sufficient funds may require evidence from other parties, *e.g.*, the applicant's bank. Therefore this is specified as a goal.

#### **Goal decomposition**

<sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup> −→ [] <sup>r</sup> Θ; Λ; Δ −→ [G1] Θ; Λ; Δ −→ [G2] <sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup> −→ [G<sup>1</sup> <sup>∧</sup> <sup>G</sup>2] <sup>∧</sup><sup>r</sup> Θ; Λ; Δ −→ [G[t/x]] <sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup> −→ [∃x.G] <sup>∃</sup><sup>r</sup> <sup>Θ</sup>; <sup>Λ</sup>; [Δ] −→ <sup>K</sup> : kbQA <sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup> −→ [<sup>K</sup> : kbQA] <sup>G</sup> <sup>⇒</sup> :l Θ | - <sup>Q</sup>−→ A <sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup> −→ [<sup>K</sup> : kbQA] :<sup>r</sup> +kb<sup>r</sup> + kb<sup>l</sup>

#### :<sup>l</sup> **application**

$$\frac{\Theta,\mathsf{kb}\_{\mathsf{Q}}A;A;[\Delta]\longrightarrow\mathsf{K}\mathrel{\flat\rhd}\mathtt{kb}\_{\mathsf{Q}'}A'}{\Theta;A;[\Delta,\mathsf{K}\mathrel{\flat\rhd}\mathtt{kb}\_{\mathsf{Q}}A]\longrightarrow\mathsf{K}\mathrel{\flat\rhd}\mathtt{kb}\_{\mathsf{Q}'}A'}\colon\mathbb{\wp}\models\mathtt{k}$$

$$\begin{array}{c} \Theta; [\Lambda]; \Delta^{\dagger} \longrightarrow \mathsf{K} \mathbin{\rightsquigarrow \mathsf{kb}} \mathsf{A} \mathsf{b}\_{\mathsf{Q}} A \\ \hline \Theta; \Lambda; [\Delta^{\dagger}] \longrightarrow \mathsf{K} \mathbin{\rightsquigarrow \mathsf{kb}} \mathsf{A} \mathsf{b}\_{\mathsf{Q}} A \end{array} \scriptstyle \Rightarrow \begin{array}{c} \Theta; \Lambda; \Delta^{\dagger} \longrightarrow [\mathsf{K} \mathbin{\rightsquigarrow \mathsf{kb}} \mathsf{A}] \\ \hline \Theta; \Lambda; [\Delta^{\dagger}] \longrightarrow \mathsf{K} \mathbin{\rightsquigarrow \mathsf{kb}} \mathsf{A} \end{array} \scriptstyle \Rightarrow \begin{array}{c} G \\ \hline \end{array}$$

#### **Attestation formula decomposition**

<sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup> −→ [Gσ] <sup>Θ</sup>; <sup>Λ</sup>; [Δ, <sup>K</sup> : kbQAσ] −→ K- : kbQ-A- <sup>Θ</sup> <sup>|</sup>Q<sup>1</sup> ; ·; · −→ [K<sup>1</sup> :- <sup>A</sup>1σ] ··· <sup>Θ</sup> <sup>|</sup>Q<sup>n</sup> ; ·; · −→ [K<sup>n</sup> :- Anσ] <sup>Θ</sup>; [Λ, <sup>∀</sup>X. - kbQ<sup>1</sup> (K<sup>1</sup> :- <sup>A</sup>1) ∧···∧ kbQ<sup>n</sup> (K<sup>n</sup> :- <sup>A</sup>n) <sup>∧</sup> <sup>G</sup> <sup>⊃</sup> <sup>K</sup> : kbQA ]; Δ −→ K- : kbQ-Aatt

> K :-A **decomposition**

evidenceKA <sup>Θ</sup>; ·; · −→ [<sup>K</sup> :- <sup>A</sup>] ext <sup>Θ</sup>- −→ A <sup>Θ</sup>; ·; · −→ [<sup>K</sup> :- <sup>A</sup>] :<sup>r</sup> +kb<sup>l</sup>

#### **First-order reasoning:**

All first-order rules from CL on Θ-−→ A sequents

**Fig. 2.** CL<sup>P</sup> – Sequent calculus for cyberlogic programs. A, A and A<sup>i</sup> are atoms, Δ† is such that for all K- : kb- QA- ∈ Δ†, K- <sup>=</sup> <sup>K</sup>, and <sup>Θ</sup>-= {F | kbQF ∈ Θ}.

#### **3.2 CPS Proof Search**

Proof search of CPS can be divided into the following phases: goal decomposition, :<sup>l</sup> application, attestation formula decomposition, K:- A decomposition, and first-order reasoning. We define a (focusing inspired) sequent calculus for the CPS fragment, called CL<sup>P</sup> (Figure 2) for enforcing this proof search discipline. Sequents in CL<sup>P</sup> have the following shape: Θ;Λ; Δ −→ F, where Θ contains kb formulas, Λ contains attestation formulas, Δ contains formulas of the form K: kbQA, and <sup>F</sup> is either a goal formula, kbQ(K:- A), K:- A or A, where A is an atom. Moreover, the part of the sequent containing the formula that is being decomposed will be enclosed in square brackets. This will help distinguishing the phases mentioned above.

**Lemma 3.** *The* kb<sup>r</sup> *rules permutes down every left rule in the* CPS *fragment.*

*Proof.* First we note that, in the CPS fragment, ∧, ∨, ∀, and kb formulas on the left do not have kb modalities as subformulas. We look at the case of kbl, as the others follow a similar argument.

Since F is not a kb formula, then F /∈ (Γ, kbQ-F, F) |Q. Therefore we can conclude that (Γ, kbQ-F, F) |Q= (Γ, kbQ-F) |<sup>Q</sup> and the permutation is:

$$\begin{array}{c} \varphi\\ \frac{(\varGamma,\mathsf{kb}\_{\varDelta'}F,F)\mid\_{\varOmega}\longrightarrow G}{\varGamma,\mathsf{kb}\_{\varDelta'}F,F\longrightarrow\mathsf{kb}\_{\varDelta}G} \mathsf{kb}\_{r} \\\hline \frac{\varGamma,\mathsf{kb}\_{\varDelta'}F,F\longrightarrow\mathsf{kb}\_{\varDelta'}G}{\varGamma,\mathsf{kb}\_{\varDelta'}F\longrightarrow\mathsf{kb}\_{\varDelta}G} \ \mathsf{kb}\_{l} \end{array} \begin{array}{c} \varphi\\ \mathbf{b}\_{r} \end{array} \begin{array}{c} \varphi\\ \mathbf{f}\end{array} \begin{array}{c} \varphi\\ \mathbf{f}\end{array} \begin{array}{c} \varphi\\ \mathbf{b}\end{array} \begin{array}{c} \varphi\\ \mathbf{b}\end{array} \begin{array}{c} \mathsf{kb}\_{r} \end{array}$$

The case for :<sup>l</sup> holds vacuously, as it is impossible to have :<sup>l</sup> immediately below kb<sup>r</sup> since the former requires the right formula to be of the shape K:-.

The remaining case is ⊃l. Observe that in the CPS fragment, the formula <sup>F</sup><sup>2</sup> in <sup>F</sup><sup>1</sup> <sup>⊃</sup> <sup>F</sup><sup>2</sup> is of the form <sup>K</sup>: kbQ-A. Therefore, (Γ, F2) |Q= Γ |Q. Also, (Γ, F<sup>1</sup> ⊃ F2) |Q= Γ |Q. Thus the permutation is:

$$\begin{array}{c} \begin{array}{c} \varphi\\ \Gamma \longrightarrow F\_{1} \end{array} \begin{array}{c} \begin{array}{c} \varphi\\ \Gamma, F\_{2} \end{array} \begin{array}{c} \big|\begin{array}{c} \big(\Gamma, F\_{2}\right) \big|\big(\begin{array}{c} \big(\Gamma, G\right) \end{array} \begin{array}{c} \big(\Gamma, F\_{2}\right) \big|\big(\begin{array}{c} \big(\Gamma, F\_{2}\right) \big|\big(\Gamma, G\big) \end{array} \begin{array}{c} \big(\Gamma, F\_{2}\right) \big|\big(\begin{array}{c} \big(\Gamma, F\_{2}\right) \big|\big(\Gamma, G\big) \end{array} \begin{array}{c} \big(\Gamma, F\_{2}\right) \big|\big(\Gamma, G\big) \end{array} \end{array} \end{array} \begin{array}{c} \begin{array}{c} \varphi\\ \Gamma, F\_{2}\big|\big(\Gamma, G\big) \end{array} \begin{array}{c} \big(\Gamma, F\_{2}\right) \big|\big(\Gamma, G\big) \end{array} \end{array} \right\}$$

Notice that it is crucial for attestation formulas to have a : modality formula on the consequent, otherwise Lemma 3 would not hold. As seen below, this lemma is key to proving completeness of the proof search procedure for CPS.

**Theorem 3 (Soundness and completeness of CL**<sup>P</sup> **).** Θ;Λ; Δ −→ [F] *in* CL<sup>P</sup> *if and only if* Θ, Λ, Δ −→ F *in* CL

*Proof.* Soundness is straightforward: a proof in CL<sup>P</sup> can be transformed into a proof in CL by using the same logical rules (possibly expanded – *e.g.* att becomes a sequence of ∀l+ ⊃<sup>l</sup> +∧r+kbr) and skipping the phase transition rules ⇒ (which only change the syntax of the sequent, but not its content).

Completeness is achieved by reasoning about invertibility and permutability of inference rules in the specific case of CPS. We argue that each phase can be performed in the proposed order.

**Goal decomposition** The goal formula can be eagerly decomposed until becoming K: kbQA before applying other rules because: <sup>r</sup> and ∧<sup>r</sup> are invertible, and in the absence of ∀<sup>r</sup> and ∃l, ∃<sup>r</sup> permutes down every rule. Once the right side formula is K: kbQA, there are two options to continue: (1) change to :<sup>l</sup> application phase, or (2) apply rules :<sup>r</sup> +kb<sup>r</sup> + kb<sup>l</sup> in Figure 1.

The first case is discussed below. In the second case, we need to argue that kb<sup>r</sup> may be applied immediately above :r. Once :<sup>r</sup> is applied, we could choose a formula from the context to continue with. However, kb<sup>r</sup> permutes down all left rules for the CPS fragment, as shown in Lemma 3. Therefore any proof that continues with a formula in Θ, Λ, or Δ above :<sup>r</sup> can be transformed into a proof where kb<sup>r</sup> is applied immediately above :<sup>r</sup>. Since kb<sup>l</sup> is invertible, it can be applied to exhaustion safely.

:<sup>l</sup> **application** After eagerly decomposing the goal, :<sup>l</sup> can be applied to exhaustion since it is an invertible rule (Lemma 2).

**Attestation formula decomposition** This phase contains only one rule, namely att, which encompasses ∀l, ⊃l, ∧r, and kbr. The quantifier rule can always be delayed until its subformula is needed, and ∧<sup>r</sup> is an invertible rule, therefore these can be chained together without loss of completeness. Due to Lemma 3, the application of kb<sup>r</sup> can be permuted down for the CPS fragment and thus it is safe to apply the rule as soon as possible.

The two top premises of att force the proof search to go back to applying invertible rules, which does not break completeness.

K:- A **decomposition** Once this state is reached, Θ is left with kb formulas whose subformulas are in first-order logic (i.e., no modalities). In this case, one can either close the proof with an external evidence, or apply :<sup>r</sup> +kb<sup>l</sup> to release the atom on the right side. The eager application of kb<sup>l</sup> is justified due to its invertibility. It can also be delayed until this point because it permutes up ⊃<sup>l</sup> and :<sup>r</sup> in CL, and it permutes up kb<sup>r</sup> in the CPS fragment (Lemma 3).

**First-order reasoning** From this point onwards, there are no modalities in the sequent so it will be proved using only first-order reasoning.

# **4 Proof Certificates**

Cyberlogic programs may be used to derive facts about attestation (goals), using pure logical reasoning (knowledge bases), principal delegation (attestation formulas), and external evidence. Once a goal is derived, evidence shall be available so that any interested party can verify that the proof is correct. Verifiable evidence means that entities do not need to trust each other's proof producing process, as long as they can check the proofs using their own trusted processes.

Given a cyberlogic program sequent of the shape: Θ;Λ; Δ −→ G one could take its full sequent calculus proof in CL<sup>P</sup> as evidence. If the interested parties know the calculus, checking validity of proofs reduces to checking the valid application of each rule. However, these proofs are too fine grained, and contain many uninteresting details that can be easily inferred. Proof certificates elide such details, and keep only the crucial steps for proof reconstruction.

Proof certificates for cyberlogic are defined inspired by λ-terms and *foundational proof certificates* [8,20] (FPC). FPC is a framework for checking proofs in different formalisms using a small trusted kernel. The proposed kernels are the sequent calculus focused systems LKF and LJF [18] for LK and LJ respectively, augmented with predicates for guiding proof search [9]. The definition of proof certificates for a proof system S relies on two parts: (1) a translation of S's formulas into LKF or LJF formulas; and (2) a correspondence of S proofs (or proof steps) to LKF or LJF proof steps. Given these two elements, a proof certificate for a proof of F in S consists of a predicate which guides a proof of F' s translation in LKF or LJF. The following proof formats can be checked in FPC: resolution, λ-terms, Horn clauses, Frege proofs, matings, tableaux, etc.

Defining LKF or LJF FPCs for cyberlogic is challenging due to the modalities : and kb, and digital evidences. LKF has been used to check proofs in modal logics [19], but the translation of modal formulas into LK formulas used the

top : <sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup> −→ [] <sup>r</sup> Ξ : Θ; Λ; Δ −→ [G[t/x]] <sup>Ξ</sup> : <sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup> −→ [∃x.G] <sup>∃</sup><sup>r</sup> Ξ<sup>1</sup> : Θ; Λ; Δ −→ [G1] Ξ<sup>2</sup> : Θ; Λ; Δ −→ [G2] split(Ξ1, Ξ2) : <sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup> −→ [G<sup>1</sup> <sup>∧</sup> <sup>G</sup>2] <sup>∧</sup><sup>r</sup> <sup>Ξ</sup> : <sup>Θ</sup>; <sup>Λ</sup>; [Δ] −→ <sup>K</sup> : kbQA toSaysL(Ξ) : <sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup> −→ [<sup>K</sup> : kbQA] <sup>G</sup> <sup>⇒</sup> :l Ψ : Θ | - <sup>Q</sup>−→ A fol(Ψ) : <sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup> −→ [<sup>K</sup> : kbQA] :<sup>r</sup> +kb<sup>r</sup> + kb<sup>l</sup> <sup>Ξ</sup> : Θ, kbQA; <sup>Λ</sup>; [Δ] −→ <sup>K</sup> : kbQ-A- Ξ : Θ; Λ; [Δ, K : kbQA] −→ <sup>K</sup> : kbQ-A- :l <sup>Ξ</sup> : <sup>Θ</sup>; [Λ]; <sup>Δ</sup>† −→ <sup>K</sup> : kbQA toAtt(Ξ) : <sup>Θ</sup>; <sup>Λ</sup>; [Δ†] −→ <sup>K</sup> : kbQ<sup>A</sup> :<sup>l</sup> <sup>⇒</sup> att <sup>Ξ</sup> : <sup>Θ</sup>; <sup>Λ</sup>; <sup>Δ</sup>† −→ [<sup>K</sup> : kbQA] toGoal(Ξ) : <sup>Θ</sup>; <sup>Λ</sup>; [Δ†] −→ <sup>K</sup> : kbQ<sup>A</sup> :<sup>l</sup> ⇒ G Ξ- : Θ; Λ; Δ −→ [Gσ] Ξ-- : Θ; Λ; [Δ, K : kbQAσ] −→ K- : kbQ-A- <sup>Ξ</sup><sup>1</sup> : <sup>Θ</sup> <sup>|</sup>Q<sup>1</sup> ; ·; · −→ [K<sup>1</sup> :- <sup>A</sup>1σ] ··· <sup>Ξ</sup><sup>n</sup> : <sup>Θ</sup> <sup>|</sup>Q<sup>n</sup> ; ·; · −→ [K<sup>n</sup> :- Anσ] att(i, σ, [Ξ1, ..., Ξn], Ξ- , Ξ--) : <sup>Θ</sup>; [Λ, i : <sup>∀</sup>X. - kbQ<sup>1</sup> (K<sup>1</sup> :- <sup>A</sup>1) ∧···∧ kbQ<sup>n</sup> (K<sup>n</sup> :- <sup>A</sup>n) <sup>∧</sup> <sup>G</sup> <sup>⊃</sup> <sup>K</sup> : kbQA ]; Δ −→ K- : kbQ-A- att evidenceK(E,A) <sup>A</sup>] ext <sup>Ψ</sup> : <sup>Θ</sup>- −→ A <sup>A</sup>] :<sup>r</sup> +kb<sup>l</sup>

**Fig. 3.** CL<sup>a</sup> <sup>P</sup> – CL<sup>P</sup> kernel for verifying CL<sup>P</sup> proof certificates of Cyberlogic programs. Δ† is such that for all K- : kb- QA- ∈ Δ†, K- <sup>=</sup> <sup>K</sup> and <sup>Θ</sup>-= {F | kbQF ∈ Θ}.

fol(Ψ) : <sup>Θ</sup>; ·; · −→ [<sup>K</sup> :-

ext(E) : <sup>Θ</sup>; ·; · −→ [<sup>K</sup> :-

modalities' semantic definition. Instead, we propose a modular CL<sup>P</sup> kernel which allows facts derived from knowledge bases or external evidence to be checked by the appropriate engine or entity.

The CL<sup>P</sup> kernel CL<sup>a</sup> <sup>P</sup> (Figure 3) is constructed by augmenting sequents with a certificate Ξ (a term indicating how the proof must proceed) and indices for the formulas in Λ. A certificate for a proof of Θ;Λ; Δ −→ G is Ξ : Θ;Λ<sup>I</sup> ; Δ −→ G, where Ξ is a term built from the predicates used in CL<sup>a</sup> <sup>P</sup> , and <sup>Λ</sup><sup>I</sup> is a mapping from indices to formulas in Λ. The indices are used in Ξ. The checking of a cyberlogic sequent Θ;Λ; Δ −→ G with certificate Ξ starts from the sequent Ξ : Θ;Λ<sup>I</sup> ; Δ −→ [G]. Certificates denoted by the letter Ψ can represent proofs in other formalisms and may be checked by another engine. The predicates in Ξ are used for the following purposes during a derivation in CL<sup>a</sup> P .

First of all, they indicate how the proof should continue when there are multiple choices. For example, if the sequent is of the form <sup>Θ</sup>;Λ; <sup>Δ</sup> −→ [K: kbQA], then Ξ must be one of toSaysL( ) or fol( ), indicating whether to work on :- modalities on the left, or finish the proof with first-order reasoning, respectively.

Secondly, certificates relay information at the appropriate moment. For example, split( , ) contains the certificates for each of the branches on a splitting rule, and ext( ) includes an external evidence for proposition A. Note that there is no certificate for ∃<sup>R</sup> since these can be instantiated with meta-variables, and unification can be verified when the proof is completed.

The certificate for rule att is more interesting. It includes the index i of the attestation formula to be decomposed, the substitution σ for the ∀ quantifier, and certificates for each premise. Note that each Ξ1, ...Ξ<sup>n</sup> must be ext( ) or fol( ). *Example 5.* Consider Example 4, and let the indices of the formulas be their labels: **main**, **cont**, and **bankStmt**. The certificate for a proof that alice can get a visa is <sup>Ξ</sup> : **cont**; **main**, **bankStmt**; · −→ cons: kb{cons}issVisa(alice, doc, visa). Where Ξ is:

att(**main**, {Id → alice, Doc → doc, V → visa}, [fol(ΨvisitOk), fol(ΨprepVisa)], ΞG, Ξ0)

The certificates ΨvisitOk and ΨprepVisa are first-order logic proof certificates from derivations using the consulate's own knowledge base.

Certificate Ξ<sup>0</sup> corresponds to att's premise where the conclusion of **main** is added to the context. This branch can be closed by removing the modalities, so Ξ<sup>0</sup> = toGoal(fol(id)), where id is a first-order logic directive to close the proof.

Certificate Ξ<sup>G</sup> guides the proof of the new goal:

cons: kb{cons}(sufFin(doc)) <sup>∧</sup> police :kb{police}(noCrimeRec(alice))

and thus Ξ<sup>G</sup> = split(Ξfin, Ξcrime). Ξfin depends on how cons decides to check for sufficient funds. It could rely on the bank and use the attestation formula **bankStmt**, in which case Ξfin has the shape

$$\text{toSaysı} \left( \text{toAtt} (\text{att} (\text{bankStmt}, \text{...}, \text{...})) \right)$$

Or it could use **cont** from its knowledge base, in which case Ξfin would be fol( ).

## **5 Related Work**

Attestation logics have been proposed for the specification of policies of several distributed systems [14,21,15,5,29,1]. We have been inspired by some of this work in the design of Cyberlogic. Actually, Cyberlogic was proposed some decades ago [29,5], but until now its proof theory had not been carefully investigated. In particular, there were no statements on cut-elimination. Additionally, we have been inspired by the previous works on authorization logics [14,21,15] to extend Cyberlogic with knowledge operators.

The main contribution of our work is the study of proof search and proof certificates for attestation logics with knowledge operators.

In previous work [14] in intuitionistic authorization logic, knowledge was restricted to one principal. As demonstrated in Example 1, allowing for multiple principal knowledge databases ensures collaboration in reasoning.

Proof search for attestation logics is not adequately addressed in the literature. Either the proposed proof systems are Hilbert-style [1,2,17] which do not enjoy the sub-formula property and therefore are not suitable for proof search, or they are sequent calculus proof system, but not focused proof systems [14,21,29,5,16]. [14] only speculates that logic programming languages can be used to carry out proof search for fragments of attestation logic. We confirm this speculation with the definition of Cyberlogic programs.

Our main inspiration for proof certificate is the work on foundational proof certificates [9]. However, the existing work did not consider proof certificates for attestation logics. Closer to our objective is the work of Libal and Volpe [19], which define proof certificates for modal logics by encoding (the semantics of) these logics in LKF. Our work instead proposes proof certificates directly in Cyberlogic. This means that we are able to capitalize on rules, such as attestation rules, to build more compact certificates. Another difference is that our proof certificates may contain (pointers to) extra-logical evidence.

Cyberlogic has been formalized in Coq [11], encoding evidential transactions for Schengen Visa applications. Our approach is different in that it lays a proof theoretic foundation to Cyberlogic. In particular, proof search is formally justified as well as the representation of Cyberlogic proofs as FPCs.

Logic programming engines, such as ETB [10], have been proposed for programming ETs. However, these engines do not (yet) support attestations, such as K:- F, local knowledge, such as kbQF, nor the use of digital certificates. We believe that this work can greatly profit from the foundations laid by this paper.

Finally, works [15,6] propose the use of evidence for authorization. Specifically, [16] show that a fragment of their system is decidable in linear time. It would be interesting to investigate how this fragment relates to Cyberlogic programs, and whether proof certificates as defined in this work can be applied to the decidable fragment. This is left for future work.

#### **6 Conclusions**

This paper lays the proof-theoretic foundations for Cyberlogic, an attestation logic for evidential transactions, and refine Cyberlogic with epistemic modalities. We identify a fragment of Cyberlogic, Cyberlogic programs, and propose a proof system similar to focused proof systems for enabling sound and complete proof search. The necessary permutations for completeness rely on the careful interplay between attestation, :-, and knowledge modalities, kbQ. We then propose a concise proof certificate format for proofs of Cyberlogic programs.

This paper is the first step for a framework enabling evidential transactions that we are currently implementing. In particular, we are extending Distributed Datalog engines available in [10] to support Cyberlogic. Moreover, we are integrating such engines with PKI infrastructure, available in, for example, Distributed Ledger Technologies. This means that evidence, both in the form of digital evidence and logical derivations in the form of FPCs, can be stored and audited through the Ledger Technologies.

We are currently investigating extensions to Cyberlogic programs to include other modalities, such as temporal and epistemic [23,12] while still preserving its good proof search properties. We have also started to study conditions for when two attestation rules can be introduced in any order. If two clauses can be introduced in any order, then they can also be introduced in parallel. Therefore, this would provide proof-theoretic justification for proof search optimization. This could be used, for example, for proposing refinements to dependency graphs used for evaluating distributed logic programming [22] which take principals into account. These results will impact the maintenance of evidential transactions, whose applications can have important consequences to, *e.g.*, certification in automotive and avionics domains.

*Acknowledgment:* We would like to thank Dian Balta, Natarajan Shankar and Tewodros Beyene for useful discussions and valuable feedback on earlier versions of this paper. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 830892 and from BayernCloud 3, AZ: 20-13-3410.I-01A-2017. Nigam is partially supported CNPq grant 303909/2018-8.

## **A Cut-elimination**

*Proof.* (Sketch) The proof follows the usual Gentzen strategy of reducing the cuts' grade and rank. The interesting cases are rank reduction over kb rules.

In the case of kbl, contraction of the main formula is needed for the permutation to work. If this was not the case, we could not conclude Γ, A −→ G from Γ, kbQA −→ G. The transformations are:

ϕ<sup>1</sup> Γ, kbQA, A −→ C Γ, kbQ<sup>A</sup> −→ <sup>C</sup> kb<sup>l</sup> <sup>ϕ</sup><sup>2</sup> Γ, kbQA, C −→ G Γ, kbQ<sup>A</sup> −→ <sup>G</sup> cut ϕ<sup>1</sup> Γ, kbQA, A −→ C ϕ<sup>2</sup> + weakening Γ, kbQA, A, C −→ G Γ, kbQA, A −→ <sup>G</sup> cut Γ, kbQA, −→ <sup>G</sup> kb<sup>l</sup> ϕ<sup>1</sup> Γ, kbQA −→ C ϕ<sup>2</sup> Γ, kbQA, A, C −→ G Γ, kbQA, C −→ <sup>G</sup> kb<sup>l</sup> Γ, kbQ<sup>A</sup> −→ <sup>G</sup> cut ϕ<sup>1</sup> + weakening Γ, kbQA, A −→ C ϕ<sup>2</sup> Γ, kbQA, A, C −→ G Γ, kbQA, A −→ <sup>G</sup> cut Γ, kbQ<sup>A</sup> −→ <sup>G</sup> kb<sup>l</sup>

The other interesting case is when we need to permute a cut over a kb<sup>r</sup> rule on the right branch:

$$\begin{array}{c} \varphi\_1\\ \varGamma \longrightarrow C \quad \frac{(\varGamma,C)\mid\_{\varOmega\_i} \longrightarrow G}{\varGamma,C \longrightarrow \mathsf{kb}\_{\varOmega\_i}G} \; \mathsf{kb}\_r\\ \hline \varGamma \longrightarrow \mathsf{kb}\_{\varOmega\_i}G \end{array}$$

There are two cases to consider:

1. C ≡ kb<sup>Q</sup>jC and Q<sup>i</sup> Q<sup>j</sup> : in this case, we can permute the cut over rules on ϕ<sup>1</sup> (left rules except :-<sup>L</sup>, which is never applicable) until it is principal. This lemma can be proved by case analysis. At this point, the premise on the left branch will be Γ |<sup>Q</sup>j−→ C . Then kb<sup>R</sup> can be applied to the end-sequent, resulting in:

ϕ 1 Γ |Qi−→ kbQjC ϕ 2 Γ |Q<sup>i</sup> , kbQjC −→ G <sup>Γ</sup> <sup>|</sup>Qi−→ <sup>G</sup> cut <sup>Γ</sup> −→ kbQi<sup>G</sup> kb<sup>r</sup>

The proof ϕ <sup>2</sup> is exactly ϕ2, since (Γ, kb<sup>Q</sup>jC ) |<sup>Q</sup>i≡ Γ |<sup>Q</sup><sup>i</sup> , kb<sup>Q</sup>jC when Q<sup>i</sup> Q<sup>j</sup> . The proof ϕ <sup>1</sup> is obtained from the proof of Γ |<sup>Q</sup>j−→ C , since Γ |<sup>Q</sup>j⊆ Γ |<sup>Q</sup><sup>i</sup> when K<sup>i</sup> Q<sup>j</sup> .

2. C ≡ kb<sup>Q</sup>jC or Q<sup>i</sup> Q<sup>j</sup> : in this case C /∈ (Γ, C) |<sup>Q</sup><sup>i</sup> , so kb<sup>r</sup> can be applied directly to the end-sequent, and the cut can be removed.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### Non-clausal Redundancy Properties *-*

Lee A. Barnett and Armin Biere

Johannes Kepler University Linz Altenbergerstraße 69, 4040 Linz, Austria {lee.barnett,armin.biere}@jku.at

Abstract. State-of-the-art refutation systems for SAT are largely based on the derivation of clauses meeting some redundancy criteria, ensuring their addition to a formula does not alter its satisfiability. However, there are strong propositional reasoning techniques whose inferences are not easily expressed in such systems. This paper extends the redundancy framework beyond clauses to characterize redundancy for Boolean constraints in general. We show this characterization can be instantiated to develop efficiently checkable refutation systems using redundancy properties for Binary Decision Diagrams (BDDs). Using a form of reverse unit propagation over conjunctions of BDDs, these systems capture, for instance, Gaussian elimination reasoning over XOR constraints encoded in a formula, without the need for clausal translations or extension variables. Notably, these systems generalize those based on the strong Propagation Redundancy (PR) property, without an increase in complexity.

#### 1 Introduction

The correctness and reliability of Boolean satisfiability (SAT) solvers is critical for many applications. For instance SAT solvers are used for verifying hardware and software systems (e.g. [19,28,44]), to search for solutions to open problems in mathematics (e.g. [38,46]), and as subroutines of other logical reasoning tools (e.g. [7,67]). Solvers should be able to provide solution certificates that are easily and externally checkable. For a satisfiable formula, any satisfying assignment is a suitable certificate and typically can be easily produced by a solver. For an unsatisfiable formula, a solver should be able to produce a refutation proof.

Modern SAT solvers primarily refute unsatisfiable formulas using clausal proof systems, such as the popular DRAT system [69] used by the annual SAT competition in recent years [4], or newer systems based on the surprisingly strong Propagation Redundancy (PR) property [33]. Clausal proof systems iteratively extend a formula, typically given in conjunctive normal form (CNF), by adding clauses that are redundant; that is, their addition to the formula does not affect whether it is satisfiable. Systems are distinguished by their underlying redundancy properties, restricted but efficiently-decidable forms of redundancy.

<sup>-</sup> Supported by the Linz Institute of Technology AI Lab funded by the State of Upper Austria, as well as the Austrian Science Fund (FWF) under project W1255-N23, the LogiCS Doctoral College on Logical Methods in Computer Science.

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5\_15 252–272, 2021.

Redundancy is a useful notion in SAT as it captures most inferences made by state-of-the-art solvers. This includes clauses implied by the current formula, such as the resolvent of two clauses or clauses learned during conflict-driven clause learning (CDCL) [8,51], as well as clauses which are not implied but derived nonetheless by certain preprocessing and inprocessing techniques [43], such as those based on blocked clauses [42,45,48]. Further, clausal proof systems based on properties like PR include short refutations for several hard families of formulas, such as those encoding the pigeonhole principle, that have no polynomiallength refutations in resolution [2] (see [16] for an overview). These redundancy properties, seen as inference systems, thus potentially offer significant improvements in efficiency, as the CDCL algorithm at the core of most solvers searches only for refutations in resolution [9]. While the recent satisfaction-driven clause learning (SDCL) paradigm has shown some initial success [35,37], it is still unclear how to design solving techniques which take full advantage of this potential.

Conversely, there are existing strong reasoning techniques which similarly exceed the abilities of CDCL alone, but are difficult to express using clausal proof systems. Important examples include procedures for reasoning over CNF formulas encoding pseudo-Boolean and cardinality constraints (see [58]), as well as Gaussian elimination (see [12,61,62,68]), which has been highlighted as a challenge for clausal proof systems [31]. Gaussian elimination, applied to sets of "exclusive-or" (XOR) constraints, is a crucial technique for many problems from cryptographic applications [62], and can efficiently solve, for example, Tseitin formulas hard for resolution [64,66]. This procedure, implemented by CryptoMiniSAT [62], Lingeling [10], and Coprocessor [50] for example, can be polynomially simulated by extended resolution, allowing inferences over new variables, and similar systems (see [56,60]). However due to the difficulty of such simulations they are not typically implemented. Instead solvers supporting these techniques simply prevent them from running when proof output is required, preferring less efficient techniques whose inferences can be more easily represented.

This paper extends the redundancy framework for clausal proof systems to include non-clausal constraints, such as XOR or cardinality constraints, presenting a characterization of redundancy for Boolean functions in general. We demonstrate a particular use of this characterization by instantiating it for functions represented by Binary Decision Diagrams [13], a powerful representation with a long history in SAT solving (e.g. [14,23,24,52,54]) and other areas of automated reasoning (e.g. [15,29,47,57]). We show the resulting refutation systems succinctly express Gaussian elimination while also generalizing existing clausal systems. Results using a prototype implementation confirm these systems allow compact and efficiently checkable refutations of CNF formulas that include embedded XOR constraints solvable by Gaussian elimination.

In the rest of the paper, Section 2 includes preliminaries and Section 3 presents the characterization of redundancy for Boolean functions. Section 4 introduces redundancy properties for BDDs, and Section 5 demonstrates their use for Gaussian elimination. Section 6 presents the results of our preliminary implementation, and Section 7 concludes.

#### 2 Preliminaries

We assume a set of Boolean variables V under a fixed order ≺ and use standard SAT terminology. The set of truth values is B = {0, 1}. An *assignment* is a function τ : V → B and the set of assignments is B<sup>V</sup> . A function f : B<sup>V</sup> → B is *Boolean*. If f(τ )=1 for some τ ∈ B<sup>V</sup> then f is *satisfiable*, otherwise f is *unsatisfiable*. Formulas express Boolean functions as usual, are assumed to be in conjunctive normal form, and are written using capital letters F and G. A clause can be represented by its set of literals and a formula by its set of clauses.

A *partial assignment* is a non-contradictory set of literals σ; that is, if l ∈ σ then ¬l ∈ σ. The *application* of a partial assignment σ to a clause C is written <sup>C</sup>|σ and defined by: <sup>C</sup>|σ <sup>=</sup> if every <sup>τ</sup> <sup>∈</sup> <sup>B</sup><sup>V</sup> that satisfies - l∈σ <sup>l</sup> also satisfies <sup>C</sup>, otherwise <sup>C</sup>|<sup>τ</sup> <sup>=</sup> {<sup>l</sup> <sup>|</sup> <sup>l</sup> <sup>∈</sup> <sup>C</sup> and l, <sup>¬</sup><sup>l</sup> ∈ <sup>σ</sup>}. For example, (x<sup>1</sup> <sup>∨</sup> <sup>x</sup>2)|{¬x1,x2} <sup>=</sup> , and (x<sup>1</sup> <sup>∨</sup> <sup>x</sup>2)|{¬x2,¬x3} = (x1). Similarly the application of <sup>σ</sup> to a formula <sup>F</sup> is written <sup>F</sup>|σ and defined by: <sup>F</sup>|σ <sup>=</sup> if <sup>C</sup>|σ <sup>=</sup> for all <sup>C</sup> <sup>∈</sup> <sup>F</sup>, otherwise <sup>F</sup>|σ <sup>=</sup> {C|σ <sup>|</sup> <sup>C</sup> <sup>∈</sup> <sup>F</sup> and <sup>C</sup>|σ <sup>=</sup> }. *Unit propagation* is the iterated replacement of <sup>F</sup> with <sup>F</sup>|{l} for each unit clause (l) <sup>∈</sup> <sup>F</sup>, until <sup>F</sup> includes the empty clause ⊥, or F contains no unit clauses. A formula F implies a clause C by *reverse unit propagation* (RUP) if unit propagation on F ∧ ¬C ends by producing ⊥ [27].

For a formula F and clause C, if F and F ∧ C are equisatisfiable (both satisfiable or both unsatisfiable) then C is *redundant* with respect to F. Efficiently identifiable redundant clauses are at the foundation of many formula simplification techniques and refutation systems (for instance, see [32,33,37,43]). In general, deciding whether a clause is redundant is complete for the complement of the class DP [6], containing both NP and co-NP [55], so solvers and proof systems rely on polynomially-decidable *redundancy properties* for checking specific instances of redundancy. The following characterization of redundant clauses provides a common framework for formulating such properties.

Theorem 1 (Heule, Kiesl, and Biere [36]). *A clause* C = ⊥ *is redundant with respect to a formula* F *if and only if there is a partial assignment* ω *such that* <sup>C</sup>|ω <sup>=</sup> *and* <sup>F</sup>|α -<sup>F</sup>|ω*, for the partial assignment* <sup>α</sup> <sup>=</sup> {¬<sup>l</sup> <sup>|</sup> <sup>l</sup> <sup>∈</sup> <sup>C</sup>}*.*

The partial assignment ω, usually called a *witness* for C, includes at least one of the literals occurring in C, while α is said to *block* the clause C. Redundancy properties can be defined by replacing in the theorem above with efficientlydecidable relations <sup>R</sup> such that <sup>R</sup> <sup>⊆</sup> -. *Propagation redundancy* (PR) [33] replaces with <sup>1</sup>, where F <sup>1</sup> G if and only if F implies each D ∈ G by RUP. The property PR gives rise to a refutation system, in which a refutation is a list of clauses <sup>C</sup>1,...,C<sup>n</sup> and witnesses <sup>ω</sup>1,...,ω<sup>n</sup> such that <sup>C</sup>k|ω<sup>k</sup> <sup>=</sup> and (F k−<sup>1</sup> i=1 <sup>C</sup>i)|α<sup>k</sup> <sup>1</sup> (<sup>F</sup> k−<sup>1</sup> i=1 <sup>C</sup>i)|ω<sup>k</sup> for all <sup>1</sup> <sup>≤</sup> <sup>k</sup> <sup>≤</sup> <sup>n</sup>, and <sup>F</sup> n i=1 <sup>C</sup><sup>i</sup> <sup>1</sup> <sup>⊥</sup>.

Most redundancy properties used in SAT solving can be understood as restricted forms of propagation redundancy. The RAT property [43] is equivalent to *literal propagation redundancy*, where the witness ω for any clause C may differ from the associated α on only one literal; that is, ω = (α \ {¬l}) ∪ {l} for some l ∈ C [36]. The DRAT system [69] is based on RAT, with the added ability to remove clauses from the accumulated formula F -<sup>C</sup>i.

Fig. 1: Different notions of redundancy and their relationships. An arrow from A to B indicates A generalizes B. Properties to the right of the thick dashed line are polynomially checkable; those to the right of the thin dotted line only derive logical consequences. Novel properties defined in this paper are grey.

#### 3 Redundancy for Boolean Functions

Theorem 1 provides a foundation for clausal proof systems by characterizing redundant clauses in a convenient way. However, the restriction to clauses places limitations on these systems, making some forms of non-clausal reasoning difficult to express. For solvers aiming to construct refutations in these systems, this translates directly to restrictions on which solving techniques can be used.

We show this characterization can be broadened to include redundancy for non-clausal constraints, and can be used to define useful redundancy properties and refutation systems. The contributions of this paper are divided into three corresponding levels of generality. The top level, covered in the current section, is the direct extension of Theorem 1 from redundancy for clauses, written R, to redundancy for Boolean functions, written <sup>R</sup>f . The middle level, the focus of Section 4, instantiates the resulting Theorem 2 to define the refutation systems RUPBDD and PRBDD based on redundancy for Binary Decision Diagrams. At the bottom level, these systems are shown to easily handle Gaussian elimination (GE) in Section 5, as well as some aspects of cardinality reasoning (CR). The relationships between these notions of redundancy are shown in Figure 1.

Each level of generality is individually important to this work. At the bottom level, the straightforward expression of Gaussian elimination by RUPBDD and PRBDD makes it more feasible for solvers to use this efficient technique with proof production, especially as these systems generalize their clausal analogs already in use. The results in Section 6 confirm the usefulness of RUPBDD for this purpose. At the middle level, we show the notion of redundancy instantiated

for BDDs in this way may be capable of other strong forms of reasoning as well. Finally, the top level provides a very general form of redundancy, independent of function representation. This may make possible the design of redundancy properties and refutation systems in contexts where the BDD representation of constraints is too large; for example, it is known that some pseudo-Boolean constraints can in general have exponential size BDD representations [1,41].

This section presents in Theorem 2 a characterization of redundancy for Boolean functions in general. One way of instantiating this characterization is demonstrated in Section 4 where the functions are represented by Binary Decision Diagrams; the resulting refutation systems are shown in Section 5 to easily express Gaussian elimination. However, the applicability of Theorem 2 is much broader, providing a foundation for redundancy-based refutation systems independent of the representation used.

Proofs of theoretical results not included in the text can be found in an extended version of this paper [5]. We begin with the property <sup>R</sup>f .

Definition 1. *A Boolean function* g *is* redundant *with respect to a Boolean function* f *if the functions* f *and* f ∧ g *are both satisfiable, or both unsatisfiable.*

As we will see, extending Theorem 1 to the non-clausal case relies on the notion of a *Boolean transformation*, or just transformation: a function ϕ : B<sup>V</sup> → B<sup>V</sup> , mapping assignments to assignments. Importantly, for a function f and transformation ϕ, in fact f ◦ ϕ : B<sup>V</sup> → B is a function as well, where as usual f ◦ ϕ (τ ) = f(ϕ(τ )). For instance let F = x<sup>1</sup> ∧ x<sup>2</sup> and for all τ ∈ B<sup>V</sup> , the transformation ϕ *flips* x1, so that ϕ(τ )(x1) = ¬τ (x1), and *ignores* x2, that is, ϕ(τ )(x2) = τ (x2). Then in fact F ◦ ϕ is expressed by the formula ¬x<sup>1</sup> ∧ x2.

Composing a function with a transformation can be seen as a generalization of the application of a partial assignment to a formula or clause as defined in the previous section. Specifically, for a partial assignment σ let σˆ refer to the following transformation: for any assignment τ , the assignment σˆ(τ ) satisfies - l∈σ <sup>l</sup>, and <sup>σ</sup><sup>ˆ</sup> ignores any <sup>x</sup> <sup>∈</sup> <sup>V</sup> such that x, <sup>¬</sup><sup>x</sup> ∈ <sup>σ</sup>. Then for any formula <sup>F</sup> the formula <sup>F</sup>|σ expresses exactly the function <sup>F</sup> ◦ <sup>σ</sup>ˆ. In particular, if <sup>α</sup> is the partial assignment blocking a clause C then notice C ◦ αˆ(τ )=0 for all τ , but αˆ ignores variables not appearing in C; consequently αˆ(τ ) = τ if τ already falsifies C. Generalizing this idea to transformations that block non-clausal constraints is more complicated. In particular, there may be multiple blocking transformations.

*Example 1.* Let g be the function g(τ )=1 if and only if τ (a) = τ (b) (i.e. g is an XOR constraint). Transformations α1, α<sup>2</sup> are shown in the table below.


Both transformations ignore all x = a, b. Notice if g(τ )=0 then τ is unaffected by either transformation, and g ◦ α1(τ ) = g ◦ α2(τ )=0 for any assignment τ . However α<sup>1</sup> and α<sup>2</sup> are different, so that, for example, if F = ¬a ∧ (b ∨ c) and τ satisfies the literals ¬a, b, and c then F ◦ α1(τ )=1 but F ◦ α2(τ )=0.

Motivated by this we define transformations blocking a function as follows.

Definition 2. *A transformation* α blocks *a function* g *if* g ◦ α *is unsatisfiable, and for any assignment* τ *if* g(τ )=0 *then* α(τ ) = τ *.*

Notice any g not equal to the constant function 1 has blocking transformations; for example, by mapping every τ satisfying g to a particular assignment falsifying it. Using this definition, the following theorem shows how the redundancy of a Boolean function g with respect to another function f can be demonstrated. This is a direct generalization of Theorem 1, using a transformation blocking g in the place of the partial assignment blocking a clause, and a transformation ω such that g ◦ ω is the constant function 1 in place of the witnessing assignment.

Theorem 2. *Let* f *be a function and* g *a non-constant function. Then* g *is redundant with respect to* f *if and only if there exist transformations* α *and* ω *such that* <sup>α</sup> *blocks* <sup>g</sup> *and* <sup>g</sup>◦<sup>ω</sup> *is the constant function 1, and further* <sup>f</sup> ◦<sup>α</sup> f ◦ω*.*

*Proof.* (⇒) Suppose g is redundant with respect to f and let α be any transformation blocking <sup>g</sup>. If <sup>f</sup> is unsatisfiable then <sup>f</sup> ◦ <sup>α</sup> is as well, so that <sup>f</sup> ◦ <sup>α</sup> f ◦ ω holds for any ω. Thus we can take as ω the transformation ω(τ ) = τ <sup>∗</sup> for all τ ∈ B<sup>V</sup> , where τ <sup>∗</sup> is some assignment satisfying g. If instead f is satisfiable, by redundancy so is f ∧ g. Here we can take as ω the transformation ω(τ ) = τ <sup>∗</sup> for all τ ∈ B<sup>V</sup> , where τ <sup>∗</sup> is some assignment satisfying f ∧ g. Then both f ◦ ω and <sup>g</sup> ◦ <sup>ω</sup> are the constant function 1, so that <sup>f</sup> ◦ <sup>α</sup> f ◦ ω holds in this case as well.

(⇐) Suppose α, ω meet the criteria stated in the theorem. We show that g is redundant by demonstrating that if f is satisfiable, then so is f ∧ g. Suppose τ is an assignment satisfying f. If also g(τ )=1, then of course τ satisfies f ∧ g. If instead g(τ )=0, then α(τ ) = τ as α blocks the function g. Thus <sup>f</sup> ◦ <sup>α</sup> (<sup>τ</sup> ) = <sup>f</sup>(α(<sup>τ</sup> )) = <sup>f</sup>(<sup>τ</sup> )=1. As <sup>f</sup> ◦ <sup>α</sup> f ◦ ω, this means f(ω(τ )) = 1. As g ◦ ω is the constant function 1 then g(ω(τ ))=1, so ω(τ ) satisfies f ∧ g.

The clausal characterization in Theorem 1 shows that the redundancy of a clause can be evidenced by providing a witnessing assignment and demonstrating that an implication holds, providing a foundation for refutations based on the iterative conjunction of clauses. Theorem 2 above shows that the redundancy of a function in general can be seen in the same way by providing transformations α and ω. Consequently this suggests how to construct refutations based on the iterative conjunction of Boolean functions.

Definition 3. *A sequence* <sup>σ</sup> = (g1, α1, ω1),...,(gn, αn, ωn) *is a* redundancy sequence *for a Boolean function* f *if:*

*1.* <sup>α</sup>k *blocks* <sup>g</sup>k *and* <sup>g</sup>k ◦ <sup>ω</sup>k *is the constant function 1, for all* <sup>1</sup> <sup>≤</sup> <sup>k</sup> <sup>≤</sup> <sup>n</sup>*, 2.* (f ∧ k−<sup>1</sup> i=1 <sup>g</sup>i) ◦ <sup>α</sup><sup>k</sup> - (f ∧ k−<sup>1</sup> i=1 <sup>g</sup>i) ◦ <sup>ω</sup>k*, for all* <sup>1</sup> <sup>≤</sup> <sup>k</sup> <sup>≤</sup> <sup>n</sup>*.*

As for clausal redundancy, refutations are intuitively based on the following: if g<sup>1</sup> is redundant with respect to f, and g<sup>2</sup> is redundant with respect to f ∧ g1, then f and f ∧g1∧g<sup>2</sup> are equisatisfiable; that is, g1∧g<sup>2</sup> is redundant with respect to f. The following holds as a direct consequence.

Proposition 1. *Let* <sup>f</sup> *be a Boolean function. If* (g1, α1, ω1),...,(gn, αn, ωn) *is a redundancy sequence for* f*, and* f ∧ n i=1 <sup>g</sup><sup>i</sup> *is unsatisfiable, then so is* <sup>f</sup>*.*

This shows, abstractly, how redundant Boolean functions can be used as a basis for refutations in the same way as redundant clauses. To define practical, and polynomially-checkable, refutation systems based on non-clausal redundancy in this way, we focus on a representation of Boolean functions that can be used within the framework described above. Specifically, we consider sets of BDDs in conjunction, just as formulas are sets of clauses in conjunction. Clauses are easily expressed by BDDs, and thus this representation easily expresses (CNF) formulas; this is necessary as we are typically interested in proving the unsatisfiability not of functions in general, but of (CNF) formulas. It is important to notice this is only a particular instantiation of Theorem 2, and that other representations of Boolean functions may give rise to useful and efficient systems as well.

BDDs [3,13,49] are compact expressions of Boolean functions in the form of rooted, directed, acyclic graphs consisting of *decision nodes*, each labeled by a variable x ∈ V and having two children, and two *terminal nodes*, labeled by 0 and 1. The BDD for a function f : B<sup>V</sup> → B is based on its *Shannon expansion*,

$$f = (\neg x \land f \circ \hat{\sigma}\_0) \lor (x \land f \circ \hat{\sigma}\_1)$$

where σ<sup>0</sup> = {¬x} and σ<sup>1</sup> = {x}, for x ∈ V . As is common we assume BDDs are *ordered* and *reduced*: if a node with variable label x precedes a node with label y in the graph then x ≺ y, and the graph has no distinct, isomorphic subgraphs. Representation this way is canonical up to variable order, so that no two distinct BDDs with the same variable order represent the same Boolean function [13].

Our use of BDDs for representing non-clausal redundancy relies on the concept of *cofactors* as developed in BDD literature. The functions f ◦σˆ<sup>0</sup> and f ◦σˆ<sup>1</sup> are called *literal cofactors* of f by ¬x and x, respectively, and are usually written <sup>f</sup>|<sup>¬</sup>x and <sup>f</sup>|x. The cofactor of <sup>f</sup> by a conjunction of literals <sup>c</sup> <sup>=</sup> <sup>l</sup>1∧···∧ln can be defined similarly, so that <sup>f</sup>|c <sup>=</sup> <sup>f</sup> ◦σˆc, for the partial assignment <sup>σ</sup>c <sup>=</sup> {l1,...,ln}. This notation is the same as for the application of a partial assignment to a clause or formula from Section 2, as the notions coincide. More precisely, if a formula <sup>F</sup> and BDD <sup>f</sup> express the same function, so do the formula <sup>F</sup>|σ<sup>c</sup> and BDD <sup>f</sup>|c.

More broadly, for BDDs f and g, a *generalized cofactor* of f by g is a BDD h such that f ∧ g = h ∧ g; that is, f and h agree on all assignments satisfying g. This leaves unspecified what value h(τ ) should take when g(τ )=0, and various different BDD operations have been developed for constructing generalized cofactors [20,21,22] The *constrain* operation [21] produces for f and g, with g not equal to the always false 0 BDD, a generalized cofactor which can be seen as the composition <sup>f</sup> ◦ <sup>π</sup>g, where <sup>π</sup>g is the transformation [63]:

$$
\pi\_g(\tau) = \begin{cases}
\tau & \text{if } g(\tau) = 1 \\
\arg\min\_{\{\tau' \mid g(\tau') = 1\}} d(\tau, \tau') & \text{otherwise.}
\end{cases}
$$

The function d is defined as follows: d(τ,τ ) = <sup>n</sup> i=1 <sup>|</sup><sup>τ</sup> (xi)−<sup>τ</sup> (xi)|· <sup>2</sup>n−<sup>i</sup> , where <sup>V</sup> <sup>=</sup> {x1,...,xn} with <sup>x</sup><sup>1</sup> ≺ ··· ≺ <sup>x</sup>n. Intuitively, <sup>d</sup> is a measure of distance between two assignments based on the variables on which they disagree, weighted by their position in the variable order. It is important to notice then that the transformation <sup>π</sup>g and the resulting <sup>f</sup> ◦ <sup>π</sup>g depend on the variable order, and may differ for distinct orders. For a conjunction of literals <sup>c</sup>, though, <sup>f</sup> ◦πc <sup>=</sup> <sup>f</sup>|c regardless of the order, so that <sup>f</sup>|g refers to <sup>f</sup> ◦ <sup>π</sup>g in general.

As the transformation <sup>π</sup>g maps an assignment falsifying the function <sup>g</sup> to the nearest assignment (with respect to d) satisfying it, a transformation that blocks the function g can surely be obtained as follows.

# Lemma 1. *If* <sup>g</sup> *is not equal to the constant function 1 then* <sup>π</sup><sup>¬</sup>g *blocks* <sup>g</sup>*.*

This form of generalized cofactor, as computed by the constrain operation, is well suited for use in redundancy-based reasoning as described above, as the transformation <sup>π</sup><sup>¬</sup>g depends only on <sup>g</sup>. As a consequence, for BDDs <sup>f</sup><sup>1</sup> and <sup>f</sup><sup>2</sup> in fact (f<sup>1</sup> <sup>∧</sup> <sup>f</sup>2)|<sup>¬</sup>g <sup>≡</sup> <sup>f</sup>1|<sup>¬</sup>g <sup>∧</sup> <sup>f</sup>2|<sup>¬</sup>g; that is, the BDD (f<sup>1</sup> <sup>∧</sup> <sup>f</sup>2)|<sup>¬</sup>g expresses the same function as the BDD for the conjunction <sup>f</sup>1|<sup>¬</sup>g <sup>∧</sup> <sup>f</sup>2|<sup>¬</sup>g. Thus given a set of BDDs <sup>f</sup>1,...,fn we can represent (f1∧···∧fn)|<sup>¬</sup>g simply by the set of cofactors <sup>f</sup>i|<sup>¬</sup>g and without constructing the BDD for the conjunction <sup>f</sup><sup>1</sup> ∧···∧fn, which is NP-hard in general. In particular, given a formula <sup>F</sup> <sup>=</sup> <sup>C</sup><sup>1</sup> ∧···∧ <sup>C</sup>n and a Boolean constraint <sup>g</sup>, the function <sup>F</sup>|<sup>¬</sup>g can be represented simply by applying the constrain operation to each of the BDDs representing <sup>C</sup>i. Therefore, from Theorem 2 we can characterize redundancy for conjunctions of BDDs, written RBDD, as follows.

Proposition 2. *Suppose* <sup>f</sup>1,...,fn *are BDDs and* <sup>g</sup> *is a non-constant BDD. If there is a partial assignment* {l1,...,lk} *such that for* <sup>ω</sup> <sup>=</sup> k i=1 <sup>l</sup>i*,*

$$f\_1 |\_{\neg g} \wedge \dots \wedge f\_n |\_{\neg g} \neq f\_1 |\_{\omega} \wedge \dots \wedge f\_n |\_{\omega}$$

*and* <sup>g</sup>|ω = 1 *then* <sup>g</sup> *is redundant with respect to* <sup>f</sup><sup>1</sup> <sup>∧</sup> ... <sup>∧</sup> <sup>f</sup>n*.*

#### 4 BDD Redundancy Properties

The previous section provided a characterization of redundancy for Boolean functions, and showed how this could be instantiated for BDDs. In this section we develop polynomially-checkable properties for showing that a BDD is redundant with respect to a conjunction of BDDs, and describe their use in refutation systems for proving the unsatisfiability of formulas.


Fig. 2: A procedure for unit propagation over a set of BDDs

As Theorem 1 is used for defining clausal redundancy properties, Proposition 2 gives rise to BDD redundancy properties by replacing with polynomiallydecidable relations. Similar to the use of the unit propagation procedure by the clausal properties RUP and PR, we describe a unit propagation procedure for use with a set of BDDs and derive analogous properties RUPBDD and PRBDD.

For a BDD <sup>f</sup>, the Shannon expansion shows that if <sup>f</sup>|<sup>¬</sup>l = 0 (i.e. <sup>f</sup>|<sup>¬</sup>l is the always false 0 BDD) for some literal <sup>l</sup>, then <sup>f</sup> <sup>=</sup> <sup>l</sup> <sup>∧</sup>fl, and therefore <sup>f</sup> l. Then the *units implied by* f, written U (f), can be defined as follows.

$$\text{Definition 4. } U(f) = \{ l \mid \text{var}(l) \in V \text{ and } f|\_{\neg l} = 0 \}, \text{ for } f: B^V \to B.$$

As <sup>f</sup>|<sup>¬</sup>l can be computed in <sup>O</sup>(|f|), where <sup>|</sup>f<sup>|</sup> is the number of nodes in the BDD for f [59], then U (f) can certainly be computed in O(|V |·|f|) ⊆ O(|f| <sup>2</sup>), though this can be reduced to O(|f|). We write - U (f) to mean - l∈<sup>U</sup> (f) <sup>l</sup>.

Figure 2 provides a sketch of the unit propagation procedure. Whenever U (f) is non-empty for some f in a set of BDDs, each BDD in the set can be replaced with its cofactor by - U (f). This approach to unit propagation is largely similar to that of Olivo and Emerson [53], except we consider two conflict situations: if some BDD becomes 0, or if two BDDs are the negations of each other.

For <sup>N</sup> <sup>=</sup> <sup>|</sup>f1<sup>|</sup> <sup>+</sup> ··· <sup>+</sup> <sup>|</sup>fn<sup>|</sup> the procedure UnitProp(f1,...,fn) can be performed in time <sup>O</sup>(N<sup>2</sup>). In line 5, if <sup>f</sup>j and - <sup>U</sup> (fi) share no variables, then <sup>f</sup>j <sup>=</sup> <sup>f</sup>j <sup>|</sup> - <sup>U</sup> (fi), otherwise the BDD for <sup>f</sup><sup>j</sup> <sup>|</sup> - <sup>U</sup> (fi) can be constructed in time <sup>O</sup>(|fj <sup>|</sup>) and further <sup>|</sup>fj <sup>|</sup> - <sup>U</sup> (fi)<sup>|</sup> <sup>&</sup>lt; <sup>|</sup>f<sup>j</sup> <sup>|</sup>. This procedure is correct: "conflict" is only returned when n i=1 <sup>f</sup><sup>i</sup> is unsatisfiable (see the extended paper for the proof).

Proposition 3. *If* UnitProp(f1,...,fn) *returns "conflict" then* <sup>f</sup><sup>1</sup> ∧···∧fn <sup>≡</sup> <sup>0</sup>*.*

UnitProp generalizes the usual unit propagation procedure on a formula: if C is a clause, then U (C) = ∅ implies C is a unit clause and - l∈<sup>U</sup> (C) <sup>l</sup> <sup>=</sup> <sup>C</sup>. We extend the relation <sup>1</sup> and the definition of RUP accordingly.

Definition 5. *Let* <sup>f</sup>1,...,fn *and* <sup>g</sup> = 0 *be BDDs. Then* <sup>f</sup><sup>1</sup> ∧···∧ <sup>f</sup>n *implies* <sup>g</sup> *by* RUPBDD *if* UnitProp(f1|<sup>¬</sup>g,...,fn|<sup>¬</sup>g) *returns "conflict."*

*Example 2.* Let F = {C<sup>1</sup> = b ∨ c, C<sup>2</sup> = a ∨ b, C<sup>3</sup> = a ∨ c}, and assume a ≺ b ≺ c. Consider g as shown in Figure 3, expressing the cardinality constraint g(τ )=1 if and only if τ satisfies at least two a, b, c; also written {a, b, c} ≥ 2. Figure 3

Fig. 3: Example derivation of a constraint g, shown in (a), using RUPBDD. In (b), the top line shows the BDDs for each of the clauses (b ∨ c),(a ∨ c),(a ∨ b) after cofactoring by g. The second line shows each of these BDDs after cofactoring by the unit <sup>¬</sup><sup>a</sup> <sup>∈</sup> <sup>U</sup>((<sup>b</sup> <sup>∨</sup> <sup>c</sup>)|<sup>¬</sup>g). Here, the middle BDD becomes simply the unit b, and the third line shows each BDD cofactored by the unit b. In this line, the

shows the updates made throughout UnitProp(C1|<sup>¬</sup>g, C2|<sup>¬</sup>g, C3|<sup>¬</sup>g). Notice that <sup>U</sup> (C1|<sup>¬</sup>g) = {¬a}, and <sup>U</sup> ((C2|<sup>¬</sup>g)|<sup>¬</sup>a) = {b}. Then <sup>C</sup>3|<sup>¬</sup>g after cofactoring by ¬a and b becomes the constant BDD 0, so the procedure returns "conflict." As a result, F implies the BDD g by RUPBDD.

We show that RUPBDD is a redundancy property. Given BDDs <sup>f</sup>1,...,fn, g, checking whether g is implied by RUPBDD primarily consists of the UnitProp procedure, though each <sup>f</sup>i|<sup>¬</sup>g must first be constructed, which can be done in time <sup>O</sup>(|fi|·|g|) [21]. The size of this BDD may in some cases be larger than the size of <sup>f</sup>i, though it is typically smaller [21,63] and at worst <sup>|</sup>fi|<sup>¬</sup>g|≤|fi|·|g|. Consequently it can be decided in time O(|g| <sup>2</sup> · <sup>N</sup><sup>2</sup>) whether <sup>g</sup> is implied by RUPBDD. Finally if g is implied by RUPBDD then it is redundant with respect to <sup>f</sup>1∧···∧fn; in fact, it is a logical consequence (proof of the following is available in the extended paper).

Proposition 4. *If* <sup>f</sup><sup>1</sup> ∧···∧ <sup>f</sup>n <sup>1</sup> <sup>g</sup>*, then* <sup>f</sup><sup>1</sup> ∧···∧ <sup>f</sup>n g*.*

third BDD has become 0, so a conflict is returned.

From RUPBDD the property PR can be directly generalized to this setting as well. Specifically, we define the redundancy property PRBDD as follows.

Definition 6. *Suppose* <sup>f</sup>1,...,fn *are BDDs and* <sup>g</sup> *is a non-constant BDD. Then* g *is* PRBDD *with respect to* n i=1 <sup>f</sup><sup>i</sup> *if there is partial assignment* {l1,...,lk} *such that* <sup>g</sup>|ω = 1 *and* n i=1 <sup>f</sup>i|¬<sup>g</sup> <sup>1</sup> <sup>f</sup><sup>j</sup> <sup>|</sup><sup>ω</sup> *for all* <sup>1</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>n</sup>*, where* <sup>ω</sup> <sup>=</sup> k i=1 <sup>l</sup>i*.*

Proposition <sup>2</sup> shows if <sup>g</sup> is PRBDD with respect to <sup>f</sup> <sup>=</sup> <sup>f</sup><sup>1</sup> ∧···∧ <sup>f</sup>n then <sup>g</sup> is redundant with respect to f, thus PRBDD is a redundancy property.

Notice these properties and derivations directly generalize their clausal equivalents; for example, if C is PR with respect to a formula F, then (the BDD expressing) C is PRBDD with respect to (the set of BDDs expressing) F. Deciding whether a clause C is PR with respect to a formula F is NP-complete [37]. As PRBDD generalizes PR, then PRBDD is NP-hard as well. Further, checking whether <sup>g</sup> is PRBDD with respect to <sup>f</sup><sup>1</sup> ∧···∧ <sup>f</sup>n by some candidate <sup>ω</sup> can be done polynomially as argued above, thus the following holds.

Proposition 5. *Deciding whether* <sup>g</sup> *is* PRBDD *with respect to* <sup>f</sup>1∧···∧fn*, given the BDDs* g, f1,...,fn*, is NP-complete.*

In other words, the decision problems for PR and PRBDD are of equal complexity.

The properties RUPBDD and PRBDD as defined in this section can be used to show that a BDD can be added to a set of BDDs in a satisfiability-preserving way. Of course, any clause has a straightforward and simple representation as a BDD, so that a formula can be easily represented this way as a set of BDDs. As a result RUPBDD and PRBDD can be used as systems for refuting unsatisfiable formulas. In the following, we identify a clause with its representation as a BDD, and a formula with its representation as a set of such BDDs.

To simplify the presentation of derivations based on RUPBDD and PRBDD we introduce an additional redundancy property, allowing derivations to include steps to directly derive certain BDDs *path-wise* in the following way.

Definition 7. <sup>f</sup><sup>1</sup> ∧···∧ <sup>f</sup>n *implies* <sup>g</sup> *by* RUPpath *if* (1) <sup>f</sup><sup>1</sup> ∧···∧ <sup>f</sup>n <sup>1</sup> <sup>¬</sup><sup>c</sup> *for every* <sup>c</sup> <sup>=</sup> <sup>l</sup><sup>1</sup> ∧···∧ <sup>l</sup>m *such that* <sup>l</sup>1,...,lm *is a path from the root of* <sup>g</sup> *to the 0 terminal, and* (2) <sup>|</sup>g| ≤ log2(|f1<sup>|</sup> <sup>+</sup> ··· <sup>+</sup> <sup>|</sup>fn|)*.*

If <sup>f</sup>1∧···∧fn implies <sup>g</sup> by RUPpath then it is a logical consequence of <sup>f</sup>1∧···∧fn, as this checks that no assignment satisfies both <sup>¬</sup><sup>g</sup> and <sup>f</sup>1∧···∧fn. The number of paths in a BDD g can however be exponential in |g|, as in the BDD for an XOR constraint, so the second condition ensures RUPpath is polynomially-checkable.

The property RUPpath is primarily useful as it allows the derivation of a BDD <sup>g</sup> whose representation as a set of clauses is included in {f1,...,fn}: if <sup>c</sup> corresponds to a path to 0 in g, the clause ¬c is included in the direct clausal translation of g. In this context, the restrictive condition (2) in Definition 7 can in fact be removed, since the number of paths in g is then at most n.

Definition 8. *A sequence of BDDs* <sup>g</sup>1,...,gn *is a* RUPBDD derivation *from a formula* F *if* F ∧ k−<sup>1</sup> i=1 <sup>g</sup><sup>i</sup> *implies* <sup>g</sup><sup>k</sup> *by* RUPBDD*, or by* RUPpath*, for all* <sup>1</sup> <sup>≤</sup> <sup>k</sup> <sup>≤</sup> <sup>n</sup>*. A sequence of BDD and assignment pairs* (g1, ω1),...,(gn, ωn) *is* *a* PRBDD derivation *from a formula* F *if* F ∧ k−<sup>1</sup> i=1 <sup>g</sup><sup>i</sup> *implies* <sup>g</sup><sup>k</sup> *by* RUPpath*, or* <sup>ω</sup>k *is a* PRBDD*-witness for* <sup>g</sup>k *with respect to* <sup>F</sup> <sup>∧</sup> k−<sup>1</sup> i=1 <sup>g</sup>i*, for all* <sup>1</sup> <sup>≤</sup> <sup>k</sup> <sup>≤</sup> <sup>n</sup>*.*

As RUPBDD, RUPpath, and PRBDD are redundancy properties, any RUPBDD or PRBDD derivation corresponds to a redundancy sequence of the same length.

*Example 3.* Consider the formula F = {a∨b, a∨c, b∨c, a∨d, b∨d, c∨d} and let g be the BDD such that g(τ )=1 if and only if τ satisfies at least 3 of a, b, c, d; that is, g is the cardinality constraint {a, b, c, d} ≥ 3. As seen in Example 2, the constraint g<sup>1</sup> = {a, b, c} ≥ 2 is RUPBDD with respect to F; similarly so are the constraints, <sup>g</sup><sup>2</sup> <sup>=</sup> {a, c, d} ≥ <sup>2</sup>, and <sup>g</sup><sup>3</sup> <sup>=</sup> {b, c, d} ≥ <sup>2</sup>. Now, <sup>¬</sup><sup>a</sup> <sup>∈</sup> <sup>U</sup> (g3|<sup>¬</sup>g): for any <sup>τ</sup> the assignment <sup>π</sup><sup>¬</sup>g(<sup>τ</sup> ) satisfies at most <sup>2</sup> of a, b, c, d, and if <sup>a</sup> is one of them then <sup>π</sup><sup>¬</sup>g(<sup>τ</sup> ) surely falsifies <sup>g</sup>3. As a result, (g3|<sup>¬</sup>g)|a = 0. In a similar way <sup>¬</sup><sup>b</sup> <sup>∈</sup> <sup>U</sup> (g2|<sup>¬</sup>g). Since <sup>g</sup>1|<sup>¬</sup>g cofactored by the units <sup>¬</sup><sup>a</sup> and <sup>¬</sup><sup>b</sup> is falsified, then UnitProp(g1|<sup>¬</sup>g, g2|<sup>¬</sup>g, g3|<sup>¬</sup>g) returns "conflict." Consequently <sup>g</sup> is RUPBDD with respect to F ∧ g<sup>1</sup> ∧ g<sup>2</sup> ∧ g3, and g1, g2, g3, g is a RUPBDD derivation from F.

This example can be generalized to show that RUPBDD is capable of expressing an inference rule for cardinality constraints called the *diagonal sum* [40]. For <sup>L</sup> <sup>=</sup> {l1,...,ln} let <sup>L</sup>i <sup>=</sup> <sup>L</sup> \ {li}; the diagonal sum derives <sup>L</sup> <sup>≥</sup> <sup>k</sup> + 1 from the set of all <sup>n</sup> constraints <sup>L</sup>i <sup>≥</sup> <sup>k</sup>.

While the properties and refutation systems RUPBDD and PRBDD easily extend their clausal counterparts, it is important to notice that redundancy-based systems using BDDs can be defined in other ways. For instance, say n i=1 <sup>f</sup><sup>i</sup> implies <sup>g</sup> by IMPpair if <sup>f</sup>i|<sup>¬</sup>g <sup>∧</sup>fj <sup>|</sup><sup>¬</sup>g = 0 for some i, j. Then IMPpair is polynomially checkable, computing the conjunction for each pair i, j. Moreover, it is clear that <sup>f</sup><sup>1</sup> <sup>∧</sup> <sup>f</sup><sup>2</sup> g if and only if f<sup>1</sup> ∧ f<sup>2</sup> implies g by IMPpair. As many logical inference rules have this form, it is possible that systems based on IMPpair are very strong.

#### 5 Gaussian Elimination

Next, we show how the Gaussian elimination technique for simplifying XOR constraints embedded in a formula is captured by the redundancy properties defined in the previous section. Specifically, if an XOR constraint X is derivable from a formula F by Gaussian elimination, we show there is a RUPBDD derivation from F including the BDD expressing X with only a linear size increase.

An *XOR clause* [x1,...,xn] <sup>p</sup> expresses the function f : B<sup>V</sup> → B, where <sup>V</sup> <sup>=</sup> {x1,...,xn} and <sup>p</sup> is 0 or 1, such that <sup>f</sup>(<sup>τ</sup> )=1 if and only if the number of <sup>x</sup>i <sup>∈</sup> <sup>V</sup> satisfied by <sup>τ</sup> is equal modulo 2 to <sup>p</sup>. In other words, <sup>p</sup> expresses the parity of the positive literals <sup>x</sup>i an assignment must satisfy in order to satisfy the XOR clause. As [x, y, y] <sup>p</sup> and [x] <sup>p</sup> express the same function, we assume no variable occurs more than once in an XOR clause. Notice that [ ]<sup>0</sup> expresses the constant function 1, while [ ]<sup>1</sup> expresses 0.

The Gaussian elimination procedure begins by detecting XOR clauses encoded in a formula <sup>F</sup>. The *direct encoding* <sup>D</sup>(X) of <sup>X</sup> = [x1,...,xn] <sup>p</sup> is the collection of clauses of the form <sup>C</sup> <sup>=</sup> {l1,...,ln}, where each <sup>l</sup>i is either <sup>x</sup>i or <sup>¬</sup>xi and the number of negated literals in each <sup>C</sup> is not equal modulo 2 to <sup>p</sup> The formula D(X) expresses the same function as X, containing the clauses preventing each assignment over the variables in X not satisfying X. As a result, D(X) implies the BDD expressing X by RUPpath (see the extended paper for proof).

Lemma 2. <sup>D</sup>(X) *implies* <sup>X</sup> *by* RUPpath*, for* <sup>X</sup> = [x1,...,xn] p*.*

Similar to the approach of Philipp and Rebola-Pardo [56], we represent Gaussian elimination steps by deriving the addition X ⊕ Y of XOR clauses <sup>X</sup> = [x1,...,xm, z1,...,zr] <sup>p</sup> and <sup>Y</sup> = [y1,...,yn, z1,...,zr] <sup>q</sup>, given by:

> <sup>X</sup> <sup>⊕</sup> <sup>Y</sup> = [x1,...,xm, y1,...,yn] <sup>p</sup>⊕q.

The following lemma shows that X ⊕ Y is RUPBDD with respect to X ∧ Y ; that is, if a RUPBDD derivation includes X and Y then X ⊕Y can be derived as well. This is a result of the following observation: while the precise cofactors of X and Y by ¬(X ⊕ Y ) depend on the variable order ≺, they are the negations of one another (proof is included in the extended paper).

Lemma 3. *Let* v *be the* ≺*-greatest variable in occurring in exactly one of* X *and* <sup>Y</sup> *, and assume* <sup>v</sup> *occurs in* <sup>Y</sup> *. Then* <sup>X</sup>|¬(X⊕Y ) <sup>=</sup> <sup>X</sup>*, and* <sup>Y</sup> <sup>|</sup>¬(X⊕Y ) <sup>=</sup> <sup>¬</sup>X*.*

The above lemma shows that the procedure UnitProp(X|<sup>¬</sup>X⊕Y , Y <sup>|</sup><sup>¬</sup>X⊕Y ) returns "conflict" immediately, and as a result X ⊕ Y is RUPBDD with respect to <sup>f</sup><sup>1</sup> ∧···∧ <sup>f</sup>n <sup>∧</sup> <sup>X</sup> <sup>∧</sup> <sup>Y</sup> for any set of BDDs <sup>f</sup>1,...,fn.

Define a Gaussian elimination derivation Π from a formula F as a sequence of XOR clauses <sup>Π</sup> <sup>=</sup> <sup>X</sup>1,...,XN , such that for all <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>N</sup>, either <sup>X</sup>i <sup>=</sup> <sup>X</sup>j <sup>⊕</sup>Xk for j, k < i, or <sup>D</sup>(Xi) <sup>⊆</sup> <sup>F</sup>. The size of the derivation is <sup>|</sup>Π<sup>|</sup> <sup>=</sup> <sup>N</sup> i=1 <sup>s</sup>i, where <sup>s</sup>i is the number of variables occurring in <sup>X</sup>i. We show that <sup>Π</sup> corresponds to a RUPBDD derivation with only a linear size increase. This size increase is a result of the fact that the BDD expressing an XOR clause <sup>X</sup> = [x1,...,xn] <sup>p</sup> has size 2n + 1 (proof of the following theorem is in the extended paper).

Theorem 3. *Suppose* <sup>Π</sup> <sup>=</sup> <sup>X</sup>1,...,XN *is a Gaussian elimination derivation from a formula* F*. Then there is a* RUPBDD *derivation from* F *with size* O(|Π|)*.*

A consequence of this theorem is that RUPBDD includes short refutations for formulas whose unsatisfiability can be shown by Gaussian elimination. More precisely, suppose a formula F includes the direct representations of an unsatisfiable collection of XOR clauses. Then there is a polynomial-length Gaussian elimination derivation of the unsatisfiable XOR clause [ ]<sup>1</sup> from F [62], and by Theorem 3, a polynomial-length RUPBDD derivation of the unsatisfiable BDD 0.

Notably, RUPBDD then includes short refutations of, for example, the Tseitin formulas, for which no polynomial-length refutations exist in the resolution system [64,66]. This limitation of resolution holds as well for the clausal RUP system, without the ability to introduce new variables, as it can be polynomially simulated by resolution [9,25]. As the translation into RUPBDD used to prove Theorem 3 introduces no new variables, this demonstrates the strength of RUPBDD compared to resolution and its clausal analog RUP.

Fig. 4: Usage of the tool dxddcheck, showing an example formula and refutation.

#### 6 Results

To begin to assess the practical usefulness of the systems introduced in Section 4, we have implemented in Python a prototype of a tool called dxddcheck<sup>1</sup> for checking refutations in a subset of RUPBDD. In particular we focus on the result of Section 5, that Gaussian elimination is succinctly captured by RUPBDD.

We ran the SAT solver Lingeling (version bcp) on a collection of crafted unsatisfiable formulas, all of which can be solved using Gaussian elimination. From Lingeling output we extract a list of XOR clause additions and deletions, ending with the addition of the empty clause, as shown in Figure 4. This list is passed directly to dxddcheck, which carries it out as a DRUPBDD refutation; that is, a RUPBDD refutation also allowing steps which remove or "delete" BDDs from the set. These deletion steps can be removed without affecting the correctness of the refutation, though their inclusion can decrease the time required for checking it, as is the case with DRUP and RUP.


For these experiments we used a 1.8 GHz Intel Core i5 CPU with 8 GB of memory. The table shows the time Lingeling took to solve each formula, the number of lines in the constructed proof and its size, and the time dxddcheck took to construct and check the associated DRUPBDD proof. These benchmarks

<sup>1</sup> Source code is available under the MIT license at http://fmv.jku.at/dxddcheck along with the benchmarks used and our experimental data.

are well-known challenging examples in the contexts of XOR reasoning and proof production. The rpar\_ n formulas are compact, permuted encodings of two contradictory parity constraints on n variables, described by Chew and Heule [18]. The mchess\_ n formulas are encodings of the mutilated n × n-chessboard problem, as studied by Heule, Kiesl, and Biere [34] as well as Bryant and Heule [14]. The urquhart formulas [17,65] are examples of hard Tseitin formulas.

Lingeling solved each formula by Gaussian elimination almost instantly. We ran Lingeling and Kissat [11], winner of the main track of the SAT competition in 2020, on the benchmarks without Gaussian elimination, as is required for producing clausal refutations, using an Intel Xeon E5-2620 v4 CPU at 2.10 GHz. Only rpar\_50 was solved in under about 10 hours, producing significantly larger proofs; for instance, Kissat produced a refutation of size 6911 MB.

While methods to construct clausal proofs from Gaussian elimination have been proposed, most are either lacking a public implementation or are limited in scope [18,56]. An exception is the approach very recently proposed by Gocht and Nordström using pseudo-Boolean reasoning [26], with which we are interested in carrying out a thorough comparison of results in the future.

# 7 Conclusion

We presented a characterization of redundancy for Boolean functions, generalizing the framework of clausal redundancy and efficient clausal proof systems. We showed this can be instantiated to design redundancy properties for functions given by BDDs, and polynomially-checkable refutation systems based on the conjunction of redundant BDDs, including the system PRBDD generalizing the clausal system PR. The system PRBDD also generalizes RUPBDD, which can express Gaussian elimination reasoning without extension variables or clausal translations. The results of a preliminary implementation of a subset of RUPBDD confirms such refutations are compact and can be efficiently checked.

Examples 2 and 3 show RUPBDD reasoning over cardinality constraints, and we are interested in exploring rules such as *generalized resolution* [39,40]. Other forms of non-clausal reasoning may be possible using BDD-based redundancy systems as well. We are particularly interested in exploring the property IMPpair.

While the system RUPBDD derives only constraints implied by the conjunction of the formula and previously derived constraints, PRBDD is capable of *interference-based* reasoning [30], like its clausal analog PR; there are possibly novel, non-clausal reasoning techniques taking advantage of this ability. Further, RUPBDD and PRBDD are based on the conjunction of BDDs, though Theorem 2 is more general and could be used for other ways of expressing Boolean functions. Finally we are interested in developing an optimized tool for checking proofs in the system PRBDD, as well as a certified proof checker.

Acknowledgements. We extend our thanks to Marijn Heule for his helpful comments on an earlier draft of this paper.

## References


ficial Intelligence – GWAI. LNCS, vol. 671, pp. 67–75. Springer (1992). https://doi.org/10.1007/BFb0018993


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Multi-Dimensional Interpretations for Termination of Term Rewriting**

Akihisa Yamada

National Institute of Advanced Industrial Science and Technology, Tokyo, Japan

**Abstract.** Interpretation methods constitute a foundation of termination analysis for term rewriting. From time to time remarkable instances of interpretation methods appeared, such as polynomial interpretations, matrix interpretations, arctic interpretations, and their variants. In this paper we introduce a general framework, the multi-dimensional interpretation method, that subsumes these variants as well as many previously unknown interpretation methods as instances. Employing the notion of derivers, we prove the soundness of the proposed method in an elegant way. We implement the proposed method in the termination prover NaTT and verify its significance through experiments.

## **1 Introduction**

Term rewriting [2] is a formalism for reasoning about function definitions or functional programs. For instance, a term rewrite system (TRS) Rfact [7] consisting of the following rewrite rules defines the factorial function:

fact(0) <sup>→</sup> <sup>s</sup>(0) fact(s(x)) <sup>→</sup> mul(s(x), fact(p(s(x)))) <sup>p</sup>(s(x)) <sup>→</sup> <sup>x</sup>

assuming that s, p, and mul are interpreted respectively as the successor, predecessor, and multiplication functions.

Analyzing whether a TRS *terminates*, meaning that the corresponding functional program responds or the function is well defined, has been an active research area for decades. Consequently, several fully automatic termination provers have been developed, e.g., AProVE [10], TTT2 [20], CiME [5], MU-TERM [23], and NaTT [34], and have been competing in the annual Termination Competitions (TermCOMP) [11].

Throughout their history, interpretation methods [25] have been foundational in termination analysis. They are categorized by the choice of well-founded carriers and the class of functions as which symbols are interpreted. *Polynomial interpretations* [22] use the natural numbers N as the carrier and interpretations are monotone polynomials, i.e., every variable has coefficient at least 1. Weakly monotone polynomials, i.e., zero coefficients, are allowed in the *dependency pair* method [1]. *Negative constants* are allowed using the max operator [15]. General combinations of polynomials and the max operator are proposed in both the standard [37] and the dependency pair settings [9]. *Negative coefficients* and thus

c The Author(s) 2021 A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5\_16 273–290, 2021.

non-monotone polynomials are also allowed, but in a more elaborated theoretical framework [15,9].

These methods share the common carrier N. In contrast, *matrix interpretations* [16,8] choose vectors over N as the carrier, and interpret symbols as affine maps over it. Although the carrier is generalized, matrix interpretations do not properly generalize polynomial interpretations, since not all polynomials are affine. This gap can be filled by *improved matrix interpretations*, that further generalize the carrier to square matrices [6], so that natural polynomial interpretations can be subsumed by matrix polynomials over 1 × 1 matrices. In *arctic interpretations* [19], the carrier consists of vectors over arctic naturals (<sup>N</sup> ∪ {−∞}) or integers (Z∪ {−∞}), and interpretations are affine maps over it, where affinity is with respect to the *max/plus semiring*.

Having this many variations would be welcome if you are a user of a termination tool in which someone else has already implemented all of them. It would not be so if you are the developer of a termination tool in which you will have to implement all of them. Also, to ultimately trust termination tools, one needs to formalize proof methods using proof assistants and obtain trusted certifier that validates outputs of termination tools, see, e.g., IsaFoR/CeTA [31] or CoLoR/Rainbow [4] frameworks. Although some interpretation methods have already been formalized [28,30], adding missing variants one by one would cost a significant effort.

In this paper, we introduce a general framework for interpretation methods, which subsumes most of the above-mentioned methods as instances, namely, (max-)polynomial interpretations (with negative constants), (improved) matrix interpretations, and arctic interpretations, as well as a syntactic method called *argument filtering* [1,21]. Moreover, we obtain a bunch of previously unexplored interpretation methods as other instances.

After preliminaries, we start with a convenient fact about *reduction pairs*, a central tool in termination proving with dependency pairs (Section 3).

The first step to the main contribution is the use of *derivers* [24,33], which allow us to abstract away the mathematical details of polynomials or maxpolynomials. We will obtain a key soundness result that derivers derive monotone interpretations from monotone interpretations (Section 4).

The second step is to extend derivers to multi-dimensional ones. This setting further generalizes (improved) matrix interpretations, so that max-polynomials, negative constants, and negative entries are allowed (Section 5). It will also be hinted that multi-dimensional derivers can emulate the effect of negative coefficients, although theoretical comparison is left for future work. We also show that our approach subsumes arctic interpretations by adding a treatment for −∞ (Section 6). Although the original formulation by Koprowski and Waldmann [19] has some trickiness, we will show that our simpler formulation is sufficient.

As *strict monotonicity* is crucial for proving termination without dependency pairs, and is still useful with dependency pairs, we will see how to ensure strict monotonicity (Section 7). At this point, the convenient fact we have seen in Section 3 becomes crucial.

Finally, the proposed method is implemented in the termination prover NaTT, and experimental results are reported (Section 8). We evaluate various instances of our method, some corresponding to known interpretation methods and many others not. We choose two new instances to integrate to the NaTT strategy. The new strategy proved the termination of 20 more benchmarks than the old one, and five of them were not proved by any tool in TermCOMP 2020.

## **2 Preliminaries**

We start with *order-sorted* algebras. Let <sup>S</sup> <sup>=</sup> S, be a partially ordered set, where elements in <sup>S</sup> are called *sorts* and is called the *subsort relation*. An <sup>S</sup>*-sorted set* is an <sup>S</sup>-indexed family <sup>A</sup> <sup>=</sup> {A<sup>σ</sup>}<sup>σ</sup>∈<sup>S</sup> such that <sup>σ</sup> <sup>τ</sup> implies <sup>A</sup><sup>σ</sup> <sup>⊆</sup> <sup>A</sup><sup>τ</sup> . We write <sup>A</sup>(σ1,...,σn) for the set <sup>A</sup><sup>σ</sup><sup>1</sup> ×···×A<sup>σ</sup><sup>n</sup> . A *sorted map* between <sup>S</sup>-sorted sets <sup>X</sup> and <sup>A</sup> is a mapping <sup>f</sup>, written <sup>f</sup> : <sup>X</sup> <sup>→</sup> <sup>A</sup>, such that <sup>x</sup> <sup>∈</sup> <sup>X</sup><sup>σ</sup> implies <sup>f</sup>(x) <sup>∈</sup> <sup>A</sup><sup>σ</sup>.

An <sup>S</sup>*-sorted signature* is an <sup>S</sup><sup>∗</sup> <sup>×</sup> <sup>S</sup>-indexed family <sup>F</sup> <sup>=</sup> {Fσ,τ }σ,τ∈S∗×<sup>S</sup> of function symbols.<sup>1</sup> When <sup>f</sup> ∈ F(σ1,...,σn),τ , we say <sup>f</sup> has *rank* (σ1,...,σn) <sup>→</sup> <sup>τ</sup> and *arity* <sup>n</sup> in <sup>F</sup>. We may also view sorted sets and signatures as sets: having <sup>a</sup> : <sup>σ</sup> <sup>∈</sup> <sup>A</sup> means <sup>a</sup> <sup>∈</sup> <sup>A</sup><sup>σ</sup>, and <sup>f</sup> : σ <sup>→</sup> <sup>τ</sup> ∈ F means <sup>f</sup> ∈ Fσ,τ .

*Example 1.* Consider sort Nat. We define the following {Nat}-sorted signatures:

**–** <sup>N</sup> := {<sup>0</sup> : () <sup>→</sup> Nat, <sup>1</sup> : () <sup>→</sup> Nat, <sup>2</sup> : () <sup>→</sup> Nat, ... } **–** <sup>N</sup>\* := N ∪{\* : (Nat, Nat) <sup>→</sup> Nat} **–** <sup>N</sup><sup>+</sup> := N ∪{<sup>+</sup> : (Nat, Nat) <sup>→</sup> Nat} **–** <sup>N</sup>max := N ∪{max : (Nat, Nat) <sup>→</sup> Nat}

Let us abbreviate unions of signatures by concatenations of subscripts: for instance N\*+max denotes N\* ∪ N<sup>+</sup> ∪ Nmax. Next consider sorts Neg and Int with Nat, Neg Int. We define the following {Nat, Neg, Int}-sorted signatures:

**–** <sup>Z</sup> := N ∪{<sup>0</sup> : () <sup>→</sup> Neg, -1 : () <sup>→</sup> Neg, -2 : () <sup>→</sup> Neg, ... } **–** <sup>Z</sup>\* := Z∪N\* ∪ {\* : (Neg, Neg) <sup>→</sup> Nat, \* : (Int, Int) <sup>→</sup> Int} **–** <sup>Z</sup><sup>+</sup> := Z∪N<sup>+</sup> ∪ {<sup>+</sup> : (Neg, Neg) <sup>→</sup> Neg, <sup>+</sup> : (Int, Int) <sup>→</sup> Int} **–** Zmax := Z∪Nmax ∪ {max : (Nat, Int) <sup>→</sup> Nat, max : (Int, Nat) <sup>→</sup> Nat, max : (Int, Int) <sup>→</sup> Int}

For an <sup>S</sup>-sorted signature <sup>F</sup>, an <sup>F</sup>*-algebra* A, [·] consists of an <sup>S</sup>-sorted set <sup>A</sup> called the *carrier* and a family [·] of mappings called the *interpretation* such that [f] : <sup>A</sup>σ <sup>→</sup> <sup>A</sup><sup>τ</sup> whenever <sup>f</sup> ∈ Fσ,τ .

*Example 2.* We consider the following *standard* interpretation -·:

$$\begin{aligned} \cdots \cdots \quad \text{[-2]} &:= -2 \quad \text{[-1]} := -1 \quad \text{[0]} := 0 \quad \text{[1]} := 1 \quad \text{[2]} := 2 \quad \cdots \\ \|\star\|(a,b) &:= a \cdot b \quad \text{[+]} (a,b) := a+b \quad \text{[max]} (a,b) := \max(a,b) \end{aligned}$$

Notice that N, -· is an <sup>N</sup>\*+max-algebra and Z, -· is a <sup>Z</sup>\*+max-algebra. Here, the {Nat}-sorted set <sup>N</sup> is defined by <sup>N</sup>Nat := <sup>N</sup> and the {Nat, Neg, Int}-sorted set <sup>Z</sup> is defined by <sup>Z</sup>Nat := <sup>N</sup>, <sup>Z</sup>Neg := {0, <sup>−</sup>1, <sup>−</sup>2,... } and <sup>Z</sup>Int := <sup>Z</sup>.

<sup>1</sup> In the literature, sorted signatures are given more assumptions such as monotonicity or regularity. For the purpose of this paper, these assumptions are not necessary.

*Sorted Terms:* Given an S-sorted signature F and an S-sorted set V of *variables*, the <sup>S</sup>-sorted set <sup>T</sup> (F, <sup>V</sup>) of *terms* is inductively defined as follows:

$$\begin{array}{l} - \ v \in \mathcal{T}(\mathcal{F}, \mathcal{V})^{\sigma} \text{ if } v \in \mathcal{V}^{\sigma}; \\ - \ f(s\_1, \dots, s\_n) \in \mathcal{T}(\mathcal{F}, \mathcal{V})^{\rho} \text{ if } f \in \mathcal{F}\_{\vec{\sigma}, \tau}, (s\_1, \dots, s\_n) \in \mathcal{T}(\mathcal{F}, \mathcal{V})^{\vec{\sigma}}, \text{ and } \tau \subseteq \rho. \end{array}$$

An interpretation [·] is extended over terms as follows: given <sup>α</sup> : V → <sup>A</sup>, [x]<sup>α</sup> := <sup>α</sup>(x) if <sup>x</sup> ∈ Vσ, and [f(s1,...,sn)]<sup>α</sup> := [f]([s1]α, . . . , [sn]α). The <sup>F</sup>algebra T (F, <sup>V</sup>), · (which interprets <sup>f</sup> as the mapping that takes (s1,...,sn) and returns <sup>f</sup>(s1,...,sn)) is called the *term algebra*, and a sorted map <sup>θ</sup> : V → <sup>T</sup> (F, <sup>V</sup>) is called a *substitution*. The term obtained by replacing every variable x by θ(x) in s is thus sθ.

*Term Rewriting:* This paper is concerned with termination analysis for plain term rewriting. In this setting, there is only one sort 1, and we may identify a {1}-sorted set <sup>A</sup> and the set <sup>A</sup><sup>1</sup>. The set of variables appearing in a term <sup>s</sup> is denoted by Var(s). A *context* C is a term with a special variable occurring exactly once. We denote by C[s] the term obtained by substituting by s in <sup>C</sup>. A *rewrite rule* is a pair of terms <sup>l</sup> and <sup>r</sup>, written <sup>l</sup> <sup>→</sup> <sup>r</sup>, such that l /∈ V and Var(l) <sup>⊇</sup> Var(r). A *term rewrite system (TRS)* is a set <sup>R</sup> of rewrite rules, which induces the *root rewrite step* −→<sup>R</sup> and the *rewrite step* −→<sup>R</sup> as the least relations such that lθ −→<sup>R</sup> rθ and <sup>C</sup>[lθ] −→<sup>R</sup> <sup>C</sup>[rθ], for any rule <sup>l</sup> <sup>→</sup> <sup>r</sup> ∈ R, substitution <sup>θ</sup>, and context <sup>C</sup>. A TRS <sup>R</sup> is *terminating* iff no infinite rewriting <sup>s</sup><sup>1</sup> −→<sup>R</sup> <sup>s</sup><sup>2</sup> −→<sup>R</sup> <sup>s</sup><sup>3</sup> −→<sup>R</sup> ··· is possible.

*The dependency pair (DP) framework [1,14,13]* is a *de facto* standard among automated termination provers for term rewriting. Here we briefly recapitulate its essence. The *root symbol* of a term s = f(s1,...,sn) is f and is denoted by root(s). The set of *defined* symbols in <sup>R</sup> is <sup>D</sup><sup>R</sup> := {root(l) <sup>|</sup> <sup>l</sup> <sup>→</sup> <sup>r</sup> ∈ R}. We assume a fresh *marked* symbol <sup>f</sup> for every <sup>f</sup> ∈ DR, and write <sup>s</sup> to denote the term <sup>f</sup>(s1,...,sn) for <sup>s</sup> <sup>=</sup> <sup>f</sup>(s1,...,sn). A *dependency pair* of a TRS <sup>R</sup> is a rule l <sup>→</sup> <sup>r</sup> such that root(r) ∈ D<sup>R</sup> and <sup>l</sup> <sup>→</sup> <sup>C</sup>[r] ∈ R for some context <sup>C</sup>. The set of all dependency pairs of <sup>R</sup> is denoted by DP(R). A *DP problem* P, R is just a pair of TRSs.

**Theorem 1 ([1]).** *A TRS* <sup>R</sup> *is terminating iff the DP problem* DP(R), R *is* finite*, i.e., there is no infinite chain* <sup>s</sup><sup>0</sup> −−−−→ DP(R) <sup>t</sup><sup>0</sup> −→<sup>R</sup> <sup>∗</sup> <sup>s</sup><sup>1</sup> −−−−→ DP(R) <sup>t</sup><sup>1</sup> −→<sup>R</sup> <sup>∗</sup> ··· *.*

A number of techniques called *DP processors* that simplify or decompose DP problems are proposed; see [13] for a list of such processors. Among them, the central technique for concluding the finiteness of DP problems is the *reduction pair* processor, which will be reformulated in the next section.

#### **3 Notes on Reduction Pairs**

A reduction pair is a pair , of order-like relations over terms with some conditions. Here we introduce two formulations of reduction pairs, one demanding natural assumptions of orderings, and the other, reduction pair seed, demanding only essential requirements. The first formulation is useful when proving properties of reduction pairs, while the latter is useful when devising new reduction pairs. We will show that the two notions are essentially equivalent: one can always extend a reduction pair seed into a reduction pair of the former sense. Existing formulations of reduction pairs lie strictly in between the two.

**Definition 1 (reduction pair).** *<sup>A</sup>* (quasi-)order pair , *is a pair of a quasi-order and an irreflexive relation* <sup>⊆</sup> *satisfying* compatibility*:* ; ; <sup>⊆</sup>*. The order pair is* well-founded *if is well-founded.*

*<sup>A</sup>* reduction pair *is a well-founded order pair* , *on terms, such that both and are closed under substitutions, and is closed under contexts. Here, a relation is* closed under substitutions (resp. contexts) *iff* s t *implies* sθ tθ *for every substitution* θ *(resp.* C[s] C[t] *for every context* C*).*

The above formulation of reduction pairs is strictly subsumed by standard definitions (e.g., [1,14,13]), where is not necessarily a subset of , and compatibility is weakened to either ; <sup>⊆</sup> or ; <sup>⊆</sup>. Instead, is required to be transitive but this follows from our assumptions <sup>⊆</sup> and compatibility: ; <sup>⊆</sup> ; <sup>⊆</sup>. On one hand, this means that we can safely import existing results of reduction pairs into our formulation.

**Theorem 2 (reduction pair processor [14,13]).** *Let* P, R *be a DP problem and* , *be a reduction pair such that* P∪R ⊆ *. Then the DP problem* P, R *is finite if and only if* P \ , R *is.*

*Example 3.* Consider again the TRS Rfact of the introduction. Proving that Rfact terminates in the DP framework boils down to finding a reduction pair , satisfying (considering *usable rules* [1]):

$$\mathsf{p}(\mathsf{s}(x)) \xleftarrow{\succp} x \qquad\qquad\qquad \mathsf{fact}^{\sharp}(\mathsf{s}(x)) \succ \mathsf{fact}^{\sharp}(\mathsf{p}(\mathsf{s}(x)))$$

On the other hand, one may wonder whether Definition 1 might be too restrictive. We justify our formulation by uniformly extending general "reduction pairs" into reduction pairs that comply with Definition 1. This is possible for even more general pairs of relations than standard reduction pairs.

**Definition 2 (reduction pair seed).** *A* well-founded order seed *is a pair* W, S *of relations such that* <sup>S</sup> *is well-founded and* <sup>S</sup>; <sup>W</sup> <sup>⊆</sup> <sup>S</sup><sup>+</sup>*. A* reduction pair seed *is a well-founded order seed on terms such that both* W *and* S *are closed under substitutions, and* W *is closed under contexts.*

Now we show that every reduction pair seed W, S can be extended to a reduction pair , such that <sup>W</sup> <sup>⊆</sup> and <sup>S</sup> <sup>⊆</sup>. Before that, the assumption <sup>S</sup>; <sup>W</sup> <sup>⊆</sup> <sup>S</sup><sup>+</sup> of Definition 2 is generalized as follows.

**Lemma 1.** *If* W, S *is a well-founded order seed, then* <sup>S</sup>; <sup>W</sup><sup>∗</sup> <sup>⊆</sup> <sup>S</sup><sup>+</sup>*.*

*Proof.* By induction on the number of <sup>W</sup> steps. 

**Theorem 3.** *Let* W, S *be a well-founded order seed. Then* , *is a wellfounded order pair, where* := (<sup>W</sup> <sup>∪</sup> <sup>S</sup>) <sup>∗</sup> *and* := (W∗; <sup>S</sup>) +*.*

*Proof.* It is trivial that is a quasi-order and <sup>⊆</sup> by definition. We show the well-foundedness of as follows: Suppose on the contrary we have an infinite sequence:

$$\begin{array}{cccc} a\_1 \ W^\* \ b\_1 \ S \ a\_2 \ W^\* \ b\_2 \ S \ a\_3 \ W^\* \ b\_2 \ S \cdots \end{array}$$

Then using Lemma 1 (S; <sup>W</sup><sup>∗</sup> <sup>⊆</sup> <sup>S</sup>+) we obtain <sup>a</sup><sup>1</sup> <sup>W</sup><sup>∗</sup> <sup>b</sup><sup>1</sup> <sup>S</sup><sup>+</sup> <sup>b</sup><sup>2</sup> <sup>S</sup><sup>+</sup> ··· , which contradicts the well-foundedness of S.

Now we show compatibility. By definition we have ; <sup>⊆</sup>, so it suffices to show ; <sup>⊆</sup>. By induction we reduce the claim to ; (<sup>W</sup> <sup>∪</sup> <sup>S</sup>) <sup>⊆</sup>, that is, both ; <sup>W</sup> <sup>⊆</sup> and ; <sup>S</sup> <sup>⊆</sup>. Using <sup>S</sup>; <sup>W</sup> <sup>⊆</sup> <sup>S</sup><sup>+</sup> <sup>=</sup> <sup>S</sup>; <sup>S</sup><sup>∗</sup> we have

$$\begin{aligned} \succ; W = (W^\*; S)^+; W &= (W^\*; S)^\*; W^\*; S; W\\ &\subseteq (W^\*; S)^\*; W^\*; S; S^\* &\rightharpoonup \end{aligned}$$

The other case ; <sup>S</sup> <sup>⊆</sup> is easy from the definition. 

Now we obtain the following corollary of Theorem 2 and Theorem 3.

**Corollary 1.** *Let* P, R *be a DP problem and* W, S *a reduction pair seed such that* P∪R⊆ <sup>W</sup>*. Then* P, R *is finite if and only if* P \ S, R *is.*

Notice that Definition 2 does not demand any order-like property, most notably transitivity. This is beneficial when developing new reduction pairs; for instance, *higher-order recursive path orders* [17] are known to be non-transitive, but form a reduction pair seed with their reflexive closure. Throughout the paper we use Definition 1, since it provides more useful and natural properties of orderings, which becomes crucial in Section 7.

## **4 Interpretation Methods as Derivers**

Interpretation methods construct reduction pairs from F-algebras, where F is the {1}-sorted signature of an input TRS or DP problem, and the carrier is a mathematical structure where a well-founded ordering > is known. In the DP framework, weakly monotone F-algebras play an important role.

**Definition 3 (weakly monotone algebra).** *A mapping* <sup>f</sup> : <sup>A</sup>1×···×A<sup>n</sup> <sup>→</sup> <sup>A</sup> *is* monotone *with respect to if* f(a1,...,ai,...,an) f(a1,...,a <sup>i</sup>,...,an) *whenever* <sup>a</sup><sup>1</sup> <sup>∈</sup> <sup>A</sup>1,...,a<sup>n</sup> <sup>∈</sup> <sup>A</sup>n*,* <sup>a</sup> <sup>i</sup> <sup>∈</sup> <sup>A</sup>i*, and* <sup>a</sup><sup>i</sup> <sup>a</sup> <sup>i</sup>*. A* weakly monotone <sup>F</sup>-algebra A, [·], <sup>≥</sup>, > *consists of an* <sup>F</sup>*-algebra* A, [·] *and an order pair* ≥, > *such that every* [f] *is monotone with respect to* <sup>≥</sup>*.*

*Example 4.* Continuing Example 2, N, -·, <sup>≥</sup>, > is a weakly monotone <sup>N</sup>\*+maxalgebra with the standard ordering ≥, >. Notice that Z, -·, <sup>≥</sup>, > is not a weakly monotone Z\*+max-algebra, since multiplication on integers is not necessarily monotone. Nevertheless, it is a weakly monotone Z+max ∪ N\*-algebra.

To ease presentation, from now on we assume that F is a {1}-sorted signature, while G is an S-sorted signature. It is easy nevertheless to generalize our results to an arbitrary order-sorted signature F.

**Theorem 4 ([14]).** *Let* A, [·], <sup>≥</sup>, > *be a weakly monotone* <sup>F</sup>*-algebra such that* <sup>&</sup>gt; *is well-founded in* <sup>A</sup>*. Then* [≥], [>] *is a reduction pair on* <sup>T</sup> (F, <sup>V</sup>)*, where* <sup>s</sup> [] <sup>t</sup> :⇐⇒ ∀<sup>α</sup> : V → A. [s]<sup>α</sup> [t]α*.*

Moreover, using the term algebra any reduction pair , on <sup>T</sup> (F, <sup>V</sup>) can be seen as a well-founded <sup>F</sup>-algebra T (F, <sup>V</sup>), ·, , .

*Example 5.* Continuing Example 4, -<sup>≥</sup>, -<sup>&</sup>gt; forms a reduction pair for signature <sup>N</sup>\*+max. Notice that it does not for <sup>Z</sup>+max ∪ N\*, essentially because <sup>&</sup>gt; is not well-founded in Z.

In order to prove the finiteness of a given DP problem, we need a weakly monotone F-algebra for the signature F indicated by this problem, rather than for a predefined signature like N\*+max. We fill the gap by employing the notion of *derivers* [24,33] to derive an F-algebra from one of another signature G.

**Definition 4 (deriver).** *An* <sup>F</sup>/G*-deriver is a pair of a sort* <sup>δ</sup> ∈ S *and a mapping* <sup>d</sup>*, such that* <sup>d</sup>(f) ∈ T (G, {x<sup>1</sup> : δ, . . . , <sup>x</sup><sup>n</sup> : <sup>δ</sup>})<sup>δ</sup> *when* <sup>f</sup> *has arity* <sup>n</sup> *in* <sup>F</sup>*. Given a* base <sup>G</sup>*-algebra* A, [·]*, we define the* derived <sup>F</sup>*-algebra* - <sup>A</sup>δ, d[·] *by*

$$d[f](a\_1, \ldots, a\_n) := [d(f)](\chi\_1 \mapsto a\_1, \ldots, \chi\_n \mapsto a\_n)$$

*Example 6.* Define a {fact, <sup>p</sup>, <sup>s</sup> : <sup>1</sup> <sup>→</sup> <sup>1</sup>}/Z+max-deriver Nat, d by

d(fact ) := x<sup>1</sup> d(s) := x<sup>1</sup> + 1 d(p) := max(x<sup>1</sup> - 1, 0)

Note that <sup>d</sup>(p) has sort Nat, thanks to the rank (Int, Nat) <sup>→</sup> Nat of max in <sup>Z</sup>max. The order pair d-<sup>≥</sup>, d-<sup>&</sup>gt; satisfies the constraints given in Example 3.

Now we show that an <sup>F</sup>/G-deriver yields a weakly monotone <sup>F</sup>-algebra if the base G-algebra is known to be weakly monotone. Thus, Example 6 proves that Rfact is terminating. The next result about monotonicity is folklore:

**Lemma 2.** *A mapping* <sup>f</sup> : <sup>A</sup><sup>n</sup> <sup>→</sup> <sup>A</sup> *is monotone with respect to a quasi-order* <sup>≥</sup> *if and only if* <sup>a</sup><sup>1</sup> <sup>≥</sup> <sup>b</sup>1,...,a<sup>n</sup> <sup>≥</sup> <sup>b</sup><sup>n</sup> *implies* <sup>f</sup>(a1,...,an) <sup>≥</sup> <sup>f</sup>(b1,...,bn)*.*

*Proof.* The "if" direction is due to the reflexivity of ≥, and the "only if" direction is easy by induction on <sup>n</sup> and the transitivity of <sup>≥</sup>. 

Then monotonicity is carried over to the interpretation of terms, in the following sense. For two sorted maps <sup>α</sup> : <sup>X</sup> <sup>→</sup> <sup>A</sup> and <sup>β</sup> : <sup>X</sup> <sup>→</sup> <sup>A</sup>, we write <sup>α</sup> <sup>≥</sup> <sup>β</sup> to mean that <sup>α</sup>(x) <sup>≥</sup> <sup>β</sup>(x) for any <sup>x</sup> <sup>∈</sup> <sup>X</sup><sup>σ</sup> and sort <sup>σ</sup>.

**Lemma 3.** *Let* A, [·], <sup>≥</sup>, > *be a weakly monotone* <sup>G</sup>*-algebra and* <sup>s</sup> ∈ T (G, <sup>V</sup>)<sup>σ</sup>*. If* <sup>α</sup> <sup>≥</sup> <sup>β</sup> *then* [s]<sup>α</sup> <sup>≥</sup> [s]β*.*

*Proof.* By structural induction on s. The claim is trivial if s is a variable. Consider <sup>s</sup> <sup>=</sup> <sup>f</sup>(s1,...,sn). We have [si]<sup>α</sup> <sup>≥</sup> [si]<sup>β</sup> for each <sup>i</sup> ∈ {1,...,n} by induction hypothesis. With Lemma 2 and the monotonicity of [f], we conclude:

$$[s]\alpha = [f]([s\_1]\alpha, \dots, [s\_n]\alpha) \; \geq \; [f]([s\_1]\beta, \dots, [s\_n]\beta) = [s]\beta \; \qquad \square$$

**Lemma 4.** *Let* δ, d *be an* <sup>F</sup>/G*-deriver and* A, [·], <sup>≥</sup>, > *a weakly monotone* G*-algebra. Then* - <sup>A</sup>δ, d[·], <sup>≥</sup>, > *is a weakly monotone* <sup>F</sup>*-algebra.*

*Proof.* Suppose that <sup>f</sup> has arity <sup>n</sup> in <sup>F</sup>, and for every <sup>i</sup> ∈ {1,...,n} that <sup>a</sup>i, b<sup>i</sup> <sup>∈</sup> <sup>A</sup><sup>δ</sup> and <sup>a</sup><sup>i</sup> <sup>≥</sup> <sup>b</sup>i. Then from Lemma 3,

$$\begin{aligned} d[f](a\_1, \ldots, a\_n) &= [d(f)](\mathbf{x}\_1 \mapsto a\_1, \ldots, \mathbf{x}\_n \mapsto a\_n) \\ &\ge [d(f)](\mathbf{x}\_1 \mapsto b\_1, \ldots, \mathbf{x}\_n \mapsto b\_n) = d[f](b\_1, \ldots, b\_n) \end{aligned}$$

With Lemma 2 we conclude that every <sup>d</sup>[f] is monotone with respect to <sup>≥</sup>, and hence - <sup>A</sup>δ, d[·], <sup>≥</sup>, > is a weakly monotone F-algebra. 

Thus we conclude the soundness of the deriver-based interpretation method:

**Theorem 5.** *If* δ, d *is a* <sup>F</sup>/G*-deriver,* A, [·], <sup>≥</sup>, > *is a weakly monotone* <sup>G</sup>*algebra and* <sup>&</sup>gt; *is well-founded in* <sup>A</sup><sup>δ</sup>*, then* d[≥], d[>] *is a reduction pair.*

*Proof.* Immediate consequence of Lemma 4 and Theorem 4. 

It should be clear that Theorem 5 with G = Z+max ∪N\* subsumes the polynomial interpretation method with negative constants [15, Lemma 4]. Their trick is to turn integers into naturals by applying max(·, 0), as demonstrated in Example 6 in a syntactic manner. Theorem 5 gives a slightly more general fact that one can mix max and negative constants and still get a reduction pair. As far as the author knows, this fact has not been reported elsewhere, although natural max-polynomials without negative constants are known to yield reduction pairs [9, Section 4.1].

In addition, a syntactic technique known as *argument filtering* [1,21] is also a special case of Theorem 5. In the context of higher-order rewriting, Kop and van Raamsdonk generalized argument filters into *argument functions* [18, Definition 7.7], which, in the first-order case, correspond to derivers with G being a variant of F. In these applications, base signatures and algebras are not *a priori* known, but are subject to be synthesized and analyzed.

## **5 Multi-Dimensional Interpretations**

The *matrix interpretation method* [8] uses a well-founded weakly monotone algebra Nm, [·]Mat, ≥≥, over natural vectors, with an affine interpretation:

$$[f]\_{\mathcal{M}at}(\vec{a}\_1, \dots, \vec{a}\_n) = C\_1\vec{a}\_1 + \dots + C\_n\vec{a}\_n + \vec{c}\_n$$

where <sup>C</sup>1,...,C<sup>n</sup> <sup>∈</sup> <sup>N</sup><sup>m</sup>×<sup>m</sup> and c <sup>∈</sup> <sup>N</sup><sup>m</sup>, and the following ordering:

**Definition 5 ([8,19]).** *Given an order pair* ≥, > *on* <sup>A</sup> *and a dimension* <sup>m</sup> <sup>∈</sup> <sup>N</sup>*, we define the order pair* ≥≥, *on* <sup>A</sup><sup>m</sup> *as follows:*

(a1,...,am) ( ) ≥≥ (b1,...,bm) :⇐⇒ <sup>a</sup><sup>1</sup> ( ) <sup>≥</sup> <sup>b</sup><sup>1</sup> <sup>∧</sup> <sup>a</sup><sup>2</sup> <sup>≥</sup> <sup>b</sup><sup>2</sup> ∧···∧ <sup>a</sup><sup>m</sup> <sup>≥</sup> <sup>b</sup><sup>m</sup>

*Improved* matrix interpretations [6] consider square matrices instead of vectors, and thus, in principle, matrix polynomials can be considered. Now we generalize these methods by extending derivers to multi-dimensional ones.

**Definition 6 (multi-dimensional derivers).** *An* <sup>m</sup>-dimensional <sup>F</sup>/G-deriver *consists of an* <sup>m</sup>*-tuple* δ ∈ S<sup>m</sup> *of sorts and a mapping* <sup>d</sup> *such that* <sup>d</sup>(f) <sup>∈</sup> <sup>T</sup> (G, <sup>X</sup> ) δ*, where* <sup>X</sup> := {xi,j : (δ)<sup>j</sup> <sup>|</sup> <sup>i</sup> ∈ {1,...,n}, j ∈ {1,...,m}} *if* <sup>f</sup> *has arity* <sup>n</sup> *in* <sup>F</sup>*. Given a* <sup>G</sup>*-algebra* A, [·]*, the derived* <sup>F</sup>*-algebra* - Aδ, d[·] *is defined by*

> <sup>d</sup>[f](a1,...,an) := d(f) 1 α, . . . , d(f) m α

*where* <sup>α</sup> *is defined by* <sup>α</sup>(xi,j ) := (ai)<sup>j</sup> *.*

*Example 7 ([8, Example 1]).* The TRS of the single rule <sup>f</sup>(f(x)) <sup>→</sup> <sup>f</sup>(g(f(x))) can be shown terminating by the following 2-dimensional matrix interpretation:

$$[\mathbf{f}]\_{\mathcal{M}at}(\vec{a}) = \begin{pmatrix} 1 \ 1 \\ 0 \ 0 \end{pmatrix} \vec{a} + \begin{pmatrix} 0 \\ 1 \end{pmatrix} \qquad\qquad [\mathbf{g}]\_{\mathcal{M}at}(\vec{a}) = \begin{pmatrix} 1 \ 0 \\ 0 \ 0 \end{pmatrix} \vec{a} + \begin{pmatrix} 0 \\ 0 \end{pmatrix} \vec{b}$$

The 2-dimensional {f, <sup>g</sup>}/N+-deriver - (Nat, Nat), d defined by

$$\vec{d}(\mathbf{f}) = \begin{pmatrix} \chi\_{11} + \chi\_{12} \\ 1 \end{pmatrix} \qquad \qquad \vec{d}(\mathbf{g}) = \begin{pmatrix} \chi\_{11} \\ 0 \end{pmatrix}$$

represents [·]Mat as d-·, that is, [≥≥]Mat <sup>=</sup> d-≥≥ and []Mat <sup>=</sup> d-.

Now we prove a counterpart of Theorem 5 for multi-dimensional derivers. The following lemma is one of the main results of this paper, which is somewhat surprisingly easy to prove.

**Lemma 5.** *For an* <sup>m</sup>*-dimensional* <sup>F</sup>/G*-deriver* - δ, d *and a weakly monotone* <sup>G</sup>*-algebra* A, [·], <sup>≥</sup>, >*,* - Aδ, <sup>d</sup>[·], ≥≥, *is a weakly monotone* <sup>F</sup>*-algebra.*

*Proof.* Let <sup>f</sup> have arity <sup>n</sup> in <sup>F</sup> and a1,...,an,b1,...,b<sup>n</sup> <sup>∈</sup> <sup>A</sup>δ satisfy a<sup>i</sup> ≥≥ bi. Define <sup>α</sup> and <sup>β</sup> by <sup>α</sup>(xi,j ) := (ai)<sup>j</sup> and <sup>β</sup>(xi,j ) := bi <sup>j</sup> . By assumption we have <sup>α</sup> <sup>≥</sup> <sup>β</sup>, and with Lemma 3 we have

$$\left(\vec{d}[f](\vec{a}\_1, \dots, \vec{a}\_n)\right)\_j = \left[\left(\vec{d}(f)\right)\_j\right] \alpha \ge \left[\left(\vec{d}(f)\right)\_j\right] \beta = \left(\vec{d}[f](\vec{b}\_1, \dots, \vec{b}\_n)\right)\_j$$

for every <sup>j</sup> ∈ {1,...,m}. Hence <sup>d</sup>[f](a1,...,an) ≥≥ <sup>d</sup>[f](b1,...,bn), and this concludes the proof due to Lemma 2.  **Theorem 6.** *For a multi-dimensional* <sup>F</sup>/G*-deriver* - δ, d *and a weakly monotone* <sup>G</sup>*-algebra* A, [·], <sup>≥</sup>, > *such that* <sup>&</sup>gt; *is well-founded in* <sup>A</sup>(δ)<sup>1</sup> *,* - <sup>d</sup>[≥≥], <sup>d</sup>[] *is a reduction pair.*

*Proof.* Thanks to Lemma 5 and Theorem 4, it suffices to show that is wellfounded in Aδ. Suppose on the contrary that there exists an infinite sequence a<sup>1</sup> a<sup>2</sup> ··· with a1, a2,... <sup>∈</sup> <sup>A</sup>δ. Then we have (a1)<sup>1</sup> <sup>&</sup>gt; (a2)<sup>1</sup> <sup>&</sup>gt; ··· and (a1)1,(a2)1,... <sup>∈</sup> <sup>A</sup>(δ)<sup>1</sup> , contradicting the well-foundedness of <sup>&</sup>gt; in <sup>A</sup>(δ)<sup>1</sup> . 

It should be clear that every m-dimensional (improved) matrix interpretation can be expressed as an <sup>m</sup>-dimensional (or <sup>m</sup>2-dimensional) <sup>F</sup>/N\*+-deriver. There are two more important consequences of Theorem 6: First, we can interpret symbols as non-affine maps even including max-polynomials; and second, since > is not required to be well-founded in A(δ)<sup>2</sup> ,...,A(δ)m, examples that previously required non-monotone interpretations—and hence a stronger condition than Theorem 2—can be handled.

*Example 8 (Excerpt of* AProVE 08/log*).* Consider the TRS R/ consisting of

$$\begin{aligned} x \mathbf{-0} \rightarrow x &\qquad \qquad \qquad \mathbf{0} \;/\; y \rightarrow \mathbf{0} \\ \mathbf{s}(x) \rightarrow \mathbf{s}(y) \rightarrow x - y &\qquad \qquad \mathbf{s}(x) \;/\; \mathbf{s}(y) \rightarrow (\mathbf{s}(x) - \mathbf{s}(y)) \;/\; \mathbf{s}(y) \end{aligned}$$

which defines (for simplicity, rounded up) natural division. Proving R/ terminating using dependency pairs boils down to finding a reduction pair , such that (again considering usable rules)

$$x \mathbf{-0} \succeq x \qquad \mathbf{s}(x) \mathbf{-s}(y) \succeq x - y \qquad \mathbf{s}(x) \;/\;/^\sharp \mathbf{s}(y) \succ (\mathbf{s}(x) - \mathbf{s}(y)) \;/\;/^\sharp \mathbf{s}(y)$$

A polynomial interpretation [·] <sup>P</sup>ol with negative coefficients such that

$$[0]\_{\mathcal{P}ol} = 0 \quad [\mathbf{s}]\_{\mathcal{P}ol}(x) = x + 1 \quad [\ell^\sharp]\_{\mathcal{P}ol}(x, y) = x \quad [\because]\_{\mathcal{P}ol}(x, y) = \max(x - y, 0)$$

satisfies the above constraints, but one must validate the requirements of [15, Theorem 11]. In our setting, an <sup>F</sup>/Z+max-deriver (Nat, Neg), <sup>d</sup> such that

$$d\vec{d}(\mathbf{0}) = \begin{pmatrix} \mathbf{0} \\ \mathbf{0} \end{pmatrix} \quad \vec{d}(\mathbf{s}) = \begin{pmatrix} \mathbf{x}\_{1,1} + \mathbf{1} \\ \mathbf{x}\_{1,2} - \mathbf{1} \end{pmatrix} \quad \vec{d}(\mathbf{-}) = \begin{pmatrix} \max(\mathbf{x}\_{1,1} + \mathbf{x}\_{2,2}, \mathbf{0}) \\ \mathbf{0} \end{pmatrix} \quad \vec{d}(\boldsymbol{\ell}^{\sharp}) = \begin{pmatrix} \mathbf{x}\_{1,1} \\ \mathbf{0} \end{pmatrix}$$

yields a reduction pair satisfying the above constraints.

The intuition here is that the two dimensional interpretation of s<sup>n</sup>(0) records <sup>n</sup> in the first coordinate and <sup>−</sup><sup>n</sup> in the second. Hence, one does not have to reconstruct <sup>−</sup><sup>n</sup> from <sup>n</sup> using the non-monotonic minus operation.

It seems plausible to the author that negative coefficients can be eliminated using the above idea; however, the increase of the dimension leads to more freedom in variables (the variable introduced to represent <sup>−</sup><sup>n</sup> may take values other than that) and so the ordering over terms may be different. It is left for future work to investigate whether this idea always works or not.

#### **6 Arctic Interpretations**

An *arctic interpretation* [19] [·]<sup>A</sup> is a matrix interpretation on the *arctic semiring*; that is, every interpretation [f]A(x1,...,xn) is of the form

$$C\_1 \otimes \vec{x}\_1 \oplus \dots \oplus C\_n \otimes \vec{x}\_n \oplus \vec{c} \tag{1}$$

where ⊗ and ⊕ denote the matrix multiplication and matrix addition in which the scalar addition is replaced by the max operation, and the scalar multiplication by addition; and entries of <sup>C</sup><sup>i</sup> and c are *arctic naturals* (N−∞ := <sup>N</sup>∪{−∞}) or *arctic integers* (Z−∞ := <sup>Z</sup> ∪ {−∞}). In addition, (1) must be *absolute positive*: (c)<sup>1</sup> <sup>≥</sup> 0, so that - <sup>N</sup> <sup>×</sup> <sup>N</sup><sup>m</sup>−<sup>1</sup> −∞ , [·]A, ≥≥, or - <sup>N</sup> <sup>×</sup> <sup>Z</sup><sup>m</sup>−<sup>1</sup> −∞ , [·]A, ≥≥, forms a well-founded weakly monotone algebra.

The above formulation deviates from the original [19] in two ways. First, we do not introduce the special relation such that −∞ −∞. Koprowski and Waldmann demanded this to ensure closure under general substitutions, but such a comparison cannot occur as we only need to consider substitutions that respect the carrier <sup>N</sup>×Z<sup>m</sup>−<sup>1</sup> −∞ . Second, for arctic natural interpretations they relax absolute positiveness to *somewhere finiteness*: (c)<sup>1</sup> <sup>=</sup> −∞ or (Ci)1,<sup>1</sup> <sup>=</sup> −∞ for some i. However, the two assumptions turn out to be equivalent.

**Proposition 1.** *Every arctic natural interpretation of form* (1) *is absolute positive iff it is somewhere finite.*

*Proof.* Clearly, absolute positiveness implies somewhere finiteness. For the other direction, since (c)<sup>1</sup> <sup>=</sup> −∞ trivially implies absolute positiveness, suppose that (c)<sup>1</sup> <sup>=</sup> −∞ and (Ci)1,<sup>1</sup> <sup>=</sup> −∞ for some <sup>i</sup>. We then know (y)<sup>1</sup> <sup>≥</sup> 0, where y := <sup>C</sup><sup>1</sup> <sup>⊗</sup> x<sup>1</sup> ⊕··· ⊕ <sup>C</sup><sup>n</sup> <sup>⊗</sup> xn. Hence, by <sup>c</sup> := (0,(c)2,...,(c)m), we have [f]A(x1,...,xn) = y <sup>⊕</sup> <sup>c</sup> , and this representation is absolute positive. 

One can easily obtain arctic interpretations via multi-dimensional derivers: consider a sort ANat with Nat ANat and {Nat, ANat}-sorted signature <sup>N</sup>+max-oo, extending N+max with


and extend the standard interpretation -· accordingly. We omit the easy proof of the following fact and the counterpart for arctic integer interpretations.

**Proposition 2.** *Every absolute positive arctic natural interpretation* [·]<sup>A</sup> *is represented as* d-· *via an* <sup>F</sup>/N+max-oo*-deriver* - (Nat, ANat,..., ANat), d *.*

Notice that, in practice, this requires us to deal with −∞ by ourselves since there is no standard SMT theory [3] that supports arithmetic with −∞.

# **7 Strict Monotonicity**

Before the invention of dependency pairs [1], strictly monotone algebras were necessary for proving termination by interpretation methods, and they constitute a sound and complete method for proving termination of TRSs.

**Definition 7.** *A* strictly *monotone* F*-algebra is a weakly monotone* F*-algebra* A, [·], <sup>≥</sup>, > *such that* A, [·] *is monotone with respect to both* <sup>≥</sup> *and* <sup>&</sup>gt;*.*

**Theorem 7 (cf. [36]).** *A TRS* R *is terminating if and only if there is a strictly monotone well-founded* <sup>F</sup>*-algebra* A, [·], <sup>≥</sup>, > *such that* R ⊆ [>]*.*

Moreover, strict monotonicity is a desirable property in the DP framework as it allows one to remove not only dependency pairs but also rewrite rules.

**Theorem 8 ([12]).** *A DP problem* P, R *is finite if* P \ [>], R \ [>] *is, where* A, [·], <sup>≥</sup>, > *is a strictly monotone well-founded* <sup>F</sup>*-algebra such that* P∪R ⊆ [≥]*.*

We now state a criterion that ensures the strict monotonicity of multidimensional interpretation obtained via derivers. Below we write d<sup>i</sup> to mean the mapping defined by <sup>d</sup>i(f) := d(f) i .

**Theorem 9.** *Let* - δ, d *be an* <sup>m</sup>*-dimensional* <sup>F</sup>/G*-deriver and* A, [·], <sup>≥</sup>, > *<sup>a</sup> weakly monotone* <sup>G</sup>*-algebra. Suppose that when* <sup>f</sup> *has arity* <sup>n</sup> *in* <sup>F</sup> *and* <sup>i</sup> <sup>∈</sup> {1,...,n}*,* <sup>α</sup>(xi,1) > a *implies* [d1(f)]α > [d1(f)]α(xi,<sup>1</sup> → <sup>a</sup>) *for any* <sup>α</sup> : X → <sup>A</sup> *and* <sup>a</sup> <sup>∈</sup> <sup>A</sup>*. Then* - Aδ, <sup>d</sup>[·], ≥≥, *is a strictly monotone* <sup>F</sup>*-algebra.*

*Proof.* We only prove strict monotonicity as we already know weak monotonicity by Lemma 5. So suppose that <sup>f</sup> has arity <sup>n</sup> in <sup>F</sup>, a1,...,ai,...,an, a <sup>i</sup> <sup>∈</sup> <sup>A</sup>δ and a<sup>i</sup> a <sup>i</sup>. For the first coordinate, define <sup>α</sup> by <sup>α</sup>(xk,j ) := (ak)<sup>j</sup> . Then, first using the assumption, and then Lemma 3, we conclude

$$\begin{aligned} d\_1[f](\vec{a}\_1, \dots, \vec{a}\_i, \dots, \vec{a}\_n) &= [d\_1(f)]\alpha \\ &> [d\_1(f)]\alpha (\mathbf{x}\_{i,1} \mapsto (\vec{a}\_i')\_1) \\ &\ge [d\_1(f)]\alpha (\mathbf{x}\_{i,1} \mapsto (\vec{a}\_i')\_1, \mathbf{x}\_{i,2} \mapsto (\vec{a}\_i')\_2, \dots, \mathbf{x}\_{i,m} \mapsto (\vec{a}\_i')\_m) \\ &= d\_1[f](\vec{a}\_1, \dots, \vec{a}\_i', \dots, \vec{a}\_n) \end{aligned}$$

For the other coordinates, thanks to the "new" assumption <sup>&</sup>gt; ⊆ ≥ in Definition 1 we have a<sup>i</sup> ≥≥ a <sup>i</sup>. Then the weak monotonicity ensures <sup>d</sup>[f](a1,...,ai,...an) ≥≥ d[f](a1,...,a <sup>i</sup>,...,an), from which we deduce for each <sup>j</sup> ∈ {2,...,m},

$$d\_j[f](\vec{a}\_1, \dots, \vec{a}\_i, \dots, \vec{a}\_n) \ge d\_j[f](\vec{a}\_1, \dots, \vec{a}\_i', \dots, \vec{a}\_n) \tag{7}$$

Although the above result and proof do not look surprising, it would be worth noticing that the statement is false in the standard formulation allowing <sup>&</sup>gt; ⊆ ≥ (as even in [8]).

*Example 9.* Consider the following apparently monotone matrix interpretation:

$$[\mathbf{f}]\begin{pmatrix} \mathbf{f}\_1 \end{pmatrix} \begin{pmatrix} a\_1 \\ a\_2 \end{pmatrix} := \begin{pmatrix} 1 & 0 \\ 1 & 0 \end{pmatrix} \begin{pmatrix} a\_1 \\ a\_2 \end{pmatrix} = \begin{pmatrix} a\_1 \\ a\_1 \end{pmatrix}.$$

If one had a<sup>1</sup> > b<sup>1</sup> but a<sup>1</sup> b1, then

$$[\mathfrak{f}]\left(\begin{pmatrix}a\_1\\a\_2\end{pmatrix}\right) = \begin{pmatrix}a\_1\\a\_1\end{pmatrix} \stackrel{\geqslant}{\not\lnot} \begin{pmatrix}b\_1\\b\_1\end{pmatrix} = [\mathfrak{f}]\left(\begin{pmatrix}b\_1\\a\_2\end{pmatrix}\right) \quad \text{even though} \quad \begin{pmatrix}a\_1\\a\_2\end{pmatrix} \gg \begin{pmatrix}b\_1\\a\_2\end{pmatrix}.$$

So [f] would not be monotone with respect to .

#### **8 Implementation and Experiments**

Multi-dimensional interpretations are implemented in the termination prover NaTT version 2.02, using a *template*-based approach.

**Definition 8.** *An* <sup>m</sup>*-dimensional* <sup>F</sup>/G*-deriver* template - δ, d *with* S*-sorted set* <sup>W</sup> *of* template variables *is defined as in Definition 6, but allowing* <sup>d</sup>(f) <sup>∈</sup> <sup>T</sup> (G, W∪X ) δ*. Its* instance *according to a substitution* <sup>θ</sup> : W→T (G, <sup>∅</sup>) *is the* <sup>F</sup>/G*-deriver* - δ, dθ *, defined by* dθ(f) := (d1(f)θ,. . . , dm(f)θ)*.*

In the implementation, we fix G = Z+max ∪ N\* and the base weakly monotone <sup>G</sup>-algebra Z, -·, <sup>≥</sup>, >. Given an <sup>m</sup>-dimensional deriver template - δ, d with <sup>W</sup>, our interest is now to find <sup>θ</sup> : W → <sup>Z</sup> such that dθ[s] <sup>≥</sup> dθ[t] for every (s, t) ∈ P∪R for the DP problem P, R of concern, thanks to Theorem 6. NaTT reduces this problem into an SMT problem and passes it to a backend SMT solver. The page limit is not enough to detail the reduction; in short, the constraint dθ[s] ≥≥ dθ[t] is reduced into a Boolean formula over atoms of form <sup>a</sup> \* v1, i1 \* ··· \* vn, in ≥ <sup>b</sup> \* v1, i1 \* ··· \* vn, in, where a, b ∈ T (G, <sup>W</sup>), and v1, i1...,vn, in ∈ (Var(s) <sup>∪</sup> Var(t)) × {1,...,m} are seen as variables. Internally NaTT uses a distribution approach [30], whose soundness crucially relies on the fact that the only rank of \* is (Nat, Nat) <sup>→</sup> Nat in the signature <sup>G</sup>. Then each atom is further reduced to (1) <sup>a</sup> <sup>=</sup> <sup>b</sup> if (δ)<sup>i</sup><sup>j</sup> <sup>=</sup> Int for some <sup>j</sup>, (2) <sup>a</sup> <sup>≥</sup> <sup>b</sup> if  {<sup>j</sup> <sup>|</sup> (δ)<sup>i</sup><sup>j</sup> <sup>=</sup> Neg}  is even, and (3) <sup>a</sup> <sup>≤</sup> <sup>b</sup> otherwise. Due to the last step, having coordinates of sort Int leads to a stronger constraint when ordering terms. Finally, the resulting formula, containing only template variables, is passed to the SMT solver Z3 4.8.10 [26] and a satisfying solution <sup>θ</sup> : W → <sup>Z</sup> is a desired substitution.

To verify the practical significance of the method, we evaluated various templates in a simple dependency pair setting. For a function symbol f of arity <sup>n</sup> <sup>≥</sup> 2, the <sup>k</sup>-th coordinate of template d(f) is chosen from

**–** sum: w + <sup>n</sup> <sup>i</sup>=1(<sup>b</sup> \* <sup>x</sup>i,k),

<sup>2</sup> Available at https://www.trs.cm.is.nagoya-u.ac.jp/NaTT/


**Table 1.** Evaluation of 2-dimensional templates.


where <sup>b</sup> and <sup>w</sup> introduce fresh template variables, <sup>b</sup> ranges over {0, <sup>1</sup>} and the sort of w is up to further choice. The sort of the first coordinate is turned to Nat by applying max(·, <sup>0</sup>) if necessary.

Experiments are run on the StarExec environment [29], with timeout of 300 seconds. The benchmarks are the 1507 TRSs from the TRS Standard category of the *termination problem database* 11 [32]. Due to the huge search space, we evaluate templates of dimensions up to 2. A part of the results are summarized in Table 1. Full details of the experiments are made available at http://www. trs.cm.is.nagoya-u.ac.jp/NaTT/multi/.

In the table, each coordinate is represented by the template and the sort of w. In terms of the number of successful termination proofs indicated in the "YES" column, the classical matrix interpretations (row #3) are impressively strong. Nevertheless, it is worth considering a negative coordinate (#4) as it gives 10 termination proofs that the previous version of NaTT could not find, indicated in the "New" column. In contrast, considering whole integers in the second coordinate (#5) does not look promising as the runtime grows significantly. Concerning "max", we observe that its use in the second coordinate (#6)

<sup>3</sup> This template is a subset of integer max-polynomials [9], although the fact that it yields a reduction pair is new.

<sup>4</sup> In our implementation, negative infinity is not supported. Instead, similar effect is emulated by zero coefficients.


**Table 2.** Experiments with combined strategies

degrades the performance. Using "max" in both coordinates *a la* arctic interpretations (#8, #9) gives a few new termination proofs, but the impact in the runtime is significant in the current implementation. The runtime improves by replacing some occurrences of "max" by "sum" (#10–12), while the power does not seem defected. In terms of the number of termination proofs, the heuristic choice of "sum-sum" and "max-sum" in the first coordinate (#13) performed the best among the evaluated templates.

From these experiments, we pick templates #4 and #13 to incorporate in the NaTT default strategy. The final results are summarized in Table 2. Although the runtime noticeably increases, adding both #4 and #13 gives 20 more examples solved, and five of them (AProVE\_09\_Inductive/log and four in Transformed\_ CSR\_04/) were not solved by any tool in the TermCOMP 2020.

#### **9 Conclusion**

In this paper we introduced a deriver-based multi-dimensional interpretation method. The author expects that the result makes the relationships between existing interpretation methods cleaner, and eases the task of developing and maintaining termination tools. Moreover, it yields many previously unknown interpretation methods as instances, proving the termination of some standard benchmarks that state-of-the-art termination provers could not.

Theoretical comparison with negative coefficients is left for future work, and the use of −∞ is not implemented yet. Also since this work broadens the search space, it is interesting to heuristically search for derivers rather than fixing some templates. Derivers of higher dimensions seem also interesting to explore. Finally, although the proposed method is implemented in the termination prover NaTT, there is no guarantee that the implementation is correct. In order to certify termination proofs that use multi-dimensional derivers, one must formalize the proofs in this paper, extend the certifiable proof format [27], and implement a verified function to validate such proofs.

*Acknowledgments* The author would like to thank Aaron Stump and his team for StarExec environment that ran experiments taking 40 days of node within a day. The author also thanks the anonymous reviewers of previous versions of the paper. This work was partly supported by the Austrian Science Fund (FWF) projects Y757 and P27502, and the Japan Science and Technology Agency (JST) project ERATO MMSD.

# **References**


vol. 5674, pp. 452–468. Springer (2009). https://doi.org/10.1007/978-3-642-03359- 9 31


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Finding Good Proofs for Description Logic Entailments using Recursive Quality Measures**

Christian Alrabbaa , Franz Baader , Stefan Borgwardt , Patrick Koopmann , and Alisa Kovtunova

Theoretical Computer Science, TU Dresden, Dresden, Germany

**Abstract.** Logic-based approaches to AI have the advantage that their behavior can in principle be explained to a user. If, for instance, a Description Logic reasoner derives a consequence that triggers some action of the overall system, then one can explain such an entailment by presenting a proof of the consequence in an appropriate calculus. How comprehensible such a proof is depends not only on the employed calculus, but also on the properties of the particular proof, such as its overall size, its depth, the complexity of the employed sentences and proof steps, etc. For this reason, we want to determine the complexity of generating proofs that are below a certain threshold w.r.t. a given measure of proof quality. Rather than investigating this problem for a fixed proof calculus and a fixed measure, we aim for general results that hold for wide classes of calculi and measures. In previous work, we first restricted the attention to a setting where proof size is used to measure the quality of a proof. We then extended the approach to a more general setting, but important measures such as proof depth were not covered. In the present paper, we provide results for a class of measures called recursive, which yields lower complexities and also encompasses proof depth. In addition, we close some gaps left open in our previous work, thus providing a comprehensive picture of the complexity landscape.

#### **1 Introduction**

Explainability has developed into a major issue in Artificial Intelligence, particularly in the context of sub-symbolic approaches based on Machine Learning [6]. In contrast, results produced by symbolic approaches based on logical reasoning are "explainable by design" since a derived consequence can be formally justified by showing a proof for it. In practice, things are not that easy since proofs may be very long, and even single proof steps or stated sentences may be hard to comprehend for a user that is not an expert in logic. For this reason, there has been considerable work in the Automated Deduction and Logic in AI communities on how to produce "good" proofs for certain purposes, both for full first-order logic, but also for decidable logics such a Description Logics (DLs) [9]. We mention here only a few approaches, and refer the reader to the introduction of our previous work [2] for a more detailed review.

First, there is work that transforms proofs that are produced by an automated reasoning system into ones in a calculus that is deemed to be more appropriate for human consumption [11, 22, 23]. Second, abstraction techniques are used to reduce the size of proofs by introducing definitions, lemmas, and more abstract deduction rules [16, 17]. Justification-based explanations for DLs [10, 14, 28] can be seen as a radical abstraction technique where the abstracted proof consists of a single proof step, from a minimal set of stated sentences that implies a certain consequence directly to this consequence. Finally, instead of presenting proofs in a formal, logical syntax, one can also try to increase readability by translating them into natural language text [12, 25–27] or visualizing them [5].

The purpose of this work is of a more (complexity) theoretic nature. We want to investigate how hard it is to find good proofs, where the quality of a proof is described by a measure m that assigns non-negative rational numbers to proofs. More precisely, as usual we investigate the complexity of the corresponding decision problem, i.e., the problem of deciding whether there is a proof P with <sup>m</sup>(P) <sup>≤</sup> *<sup>q</sup>* for a given rational number *<sup>q</sup>*. In order to abstract from specific logics and proof calculi, we develop a general framework in which proofs are represented as labeled, directed hypergraphs, whose hyperedges correspond to single sound derivation steps. To separate the complexity of generating good proofs from the complexity of reasoning in the underlying logic, we introduce the notion of a *deriver*, which generates a so-called *derivation structure*. This structure consists of possible proof steps, from which all proofs of the given consequence can be constructed. Basically, such a derivation structure can be seen as consisting of all relevant instantiations of the rules of a calculus that can be used to derive the consequence. We restrict the attention to decidable logics and consider derivers that produce derivation structures of polynomial or exponential size. Examples of such derivers are consequence-based reasoners for the DLs EL [7, 21] and ELI [9, 18], respectively. In our complexity results, the derivation structure is assumed to be already computed by the deriver,<sup>1</sup> i.e., the complexity of this step is not assumed to be part of the complexity of computing good proofs. Our complexity results investigate the problem along the following orthogonal dimensions: we distinguish between (i) polynomial and exponential derivers; and (ii) whether the threshold value *q* is encoded in unary or binary. The obtained complexity upper bounds hold for all instances of a considered setting, whereas the lower bounds mean that there is an instance (usually based on EL or ELI) for which this lower bound can be proved.

In our first work in this direction [2], we focused our attention on *size* as the measure of proof quality. We could show that the above decision problem is NP-complete even for polynomial derivers and unary coding of numbers. For exponential derivers, the complexity depends on the coding of numbers: NP-complete (NExpTime-complete) for unary (binary) coding. For the related measure *tree size* (which assumes that the proof hypergraphs are tree-shaped, i.e. cannot reuse already derived consequences), the complexity turned out to

<sup>1</sup> The highly efficient reasoner ELK [21] for (an extension of) EL actually produces a derivation structure, and thus is a deriver in our sense.


**Table 1.** Overview over existing and new complexity results for deciding the existence of good proofs, w.r.t. polynomial/exponential derivers and unary/binary encoding of the bound *q* (known results in gray).

be considerably lower, due to the fact that a Dijkstra-like greedy algorithm can be applied. In [3], we generalized the results by introducing a class of measures called *Ψ-measures*, which contains both size and tree size and for which the same complexity upper bounds as for size could be shown for polynomial derivers. We also lifted the better upper bounds for tree size (for polynomial derivers) to *local Ψ-measures*, a natural class of proof measures. In this paper, we extend this line of research by providing a more general notion of measures, *monotone recursive* Φ*-measures*, which now also allow to measure the *depth* of a proof. We think that depth is an important measure since it measures how much of the proof tree a (human or automated) proof checker needs to keep in memory at the same time. We analyze these measures not only for polynomial derivers, but this time also consider exponential derivers, thus giving insights on how our complexity results transfer to more expressive logics. In addition to upper bounds for the general class of monotone recursive Φ-measures, we show improved bounds for the specific measures considering depth and tree size, in the latter case improving results from [2]. Overall, we thus obtain a comprehensive picture of the complexity landscape for the problem of finding good proofs for DL and other entailments (see Table 1).

An extended version of this paper with detailed proofs can be found at [4].

#### **2 Preliminaries**

Most of our theoretical discussion applies to arbitrary *logics* <sup>L</sup> = (SL*,* <sup>|</sup>=L) that consist of a set <sup>S</sup><sup>L</sup> of <sup>L</sup>*-sentences* and a *consequence relation* <sup>|</sup>=<sup>L</sup> <sup>⊆</sup> *<sup>P</sup>*(SL) × S<sup>L</sup> between L*-theories*, i.e. subsets of L-sentences, and single L-sentences. We assume that <sup>|</sup>=<sup>L</sup> has a semantic definition, i.e. for some definition of "model", T |=<sup>L</sup> *<sup>η</sup>* holds iff every model of all elements in <sup>T</sup> is also a model of *<sup>η</sup>*. We also assume that the *size* <sup>|</sup>*η*<sup>|</sup> of an <sup>L</sup>-sentence *<sup>η</sup>* is defined in some way, e.g. by the number of symbols in *<sup>η</sup>*. Since <sup>L</sup> is usually fixed, we drop the prefix "L-" from now on. For example, L could be *first-order logic*. However, we are mainly interested in proofs for DLs, which can be seen as decidable fragments of first-order logic [9]. In particular, we use specific DLs to show our hardness results.

$$\begin{array}{cccc} \mathsf{R\_{0}} & \mathsf{R\_{T}} & \mathsf{R\_{T}} & \frac{}{C \sqsubseteq \top} & \mathsf{R\_{E}} & \frac{C \sqsubseteq D}{C \sqsubseteq E} : D \sqsubseteq E \in \mathcal{T} & \mathsf{R\_{\tau, 1}} & \frac{C \sqsubseteq D \sqcap E}{C \sqsubseteq D} \\\\ \mathsf{R\_{\tau, 2}} & \frac{C \sqsubseteq D \sqcap E}{C \sqsubseteq E} & \mathsf{R\_{\tau}} & \frac{C \sqsubseteq D \sqcap C \sqsubseteq E}{C \sqsubseteq D \sqcap E} & \mathsf{R\_{\tau}} & \frac{C \sqsubseteq \exists r.D \quad D \sqsubseteq E}{C \sqsubseteq \exists r.E} \end{array}$$

**Fig. 1.** The inference rules for EL used in Elk [21].

The syntax of DLs is based on disjoint, countably infinite sets N<sup>C</sup> and N<sup>R</sup> of *concept names A, B, . . .* and *role names r, s, . . .* , respectively. Sentences of the DL EL, called *general concept inclusions (GCIs)*, are of the form *<sup>C</sup> <sup>D</sup>*, where *<sup>C</sup>* and *<sup>D</sup>* are EL*-concepts*, which are built from concept names by applying the constructors (*top*), *<sup>C</sup> <sup>D</sup>* (*conjunction*), and <sup>∃</sup>*r.C* (*existential restriction* for a role name *<sup>r</sup>*). The DL ELI extends EL by the role constructor *<sup>r</sup>*<sup>−</sup> (*inverse role*). In DLs, finite theories are called *TBoxes* or *ontologies*.

The semantics of DLs is based on first-order interpretations; for details, see [9]. In Figure 1, we depict a simplified version of the inference rules for EL from [21]. For example, {*<sup>A</sup>* ∃*r.B, B C,* <sup>∃</sup>*r.C <sup>D</sup>*} |<sup>=</sup> *<sup>A</sup> <sup>D</sup>* is a valid inference in EL. Deciding consequences in EL is <sup>P</sup>-complete [7], and in ELI it is ExpTime-complete [8].

#### **2.1 Proofs**

We formalize proofs as (labeled, directed) *hypergraphs* (see Figures 2, 3), which are tuples (*V,E,* ) consisting of a finite set *V* of *vertices*, a finite set *E* of *(hyper)edges* of the form (*S, d*) with *<sup>S</sup>* <sup>⊆</sup> *<sup>V</sup>* and *<sup>d</sup>* <sup>∈</sup> *<sup>V</sup>* , and a *vertex labeling function* : *<sup>V</sup>* → SL. Full definitions of such hypergraphs, as well as related notions such as *trees*, *unravelings*, *homomorphisms*, *cycles* can be found in the extended version [4]. For example, there is a homomorphism from Figure 3 to Figure 2, but not vice versa, and Figure 3 is the tree unraveling of Figure 2.

**Fig. 2.** An acyclic hypergraph/proof

**Fig. 3.** A tree hypergraph/proof

The following definition formalizes basic requirements for hyperedges to be considered valid inference steps from a given finite theory.

**Definition 1 (Derivation Structure).** *<sup>A</sup>* derivation structure <sup>D</sup> = (*V,E,* ) *over a finite theory* T *is a hypergraph that is*

**–** grounded*, i.e. every leaf <sup>v</sup> in* <sup>D</sup> *is labeled by* (*v*) ∈ T *; and*

**–** sound*, i.e. for every* (*S, d*) <sup>∈</sup> *<sup>E</sup>, the entailment* {(*s*) <sup>|</sup> *<sup>s</sup>* <sup>∈</sup> *<sup>S</sup>*} |<sup>=</sup> (*d*) *holds.*

We define proofs as special derivation structures that derive a conclusion.

**Definition 2 (Proof).** *Given a conclusion <sup>η</sup> and a finite theory* <sup>T</sup> *, a* proof for T |<sup>=</sup> *<sup>η</sup> is a derivation structure* <sup>P</sup> = (*V,E,* ) *over* <sup>T</sup> *such that*


*A* tree proof *is a proof that is a tree. A* subproof *S of a hypergraph H is a subgraph of H that is a proof s.t. the leaves of S are a subset of the leaves of H.*

The hypergraphs in Figures 2 and 3 can be seen as proofs in the sense of Definition 2, where the sentences of the theory are marked with a thick border. Both proofs use the same inference steps, but have different numbers of vertices. They both prove *<sup>A</sup> <sup>B</sup>* ∃*r.A* from <sup>T</sup> <sup>=</sup> {*<sup>A</sup> B, B* ∃*r.A*}. The second proof is a tree and the first one a hypergraph without label repetition.

**Lemma 3.** *Let* <sup>P</sup> = (*V,E,* ) *be a proof for* T |<sup>=</sup> *<sup>η</sup>. Then*

*1. all paths in* <sup>P</sup> *are finite and all longest paths in* <sup>P</sup> *have <sup>v</sup><sup>η</sup> as the target; and 2.* T |<sup>=</sup> *<sup>η</sup>.*

Given a proof <sup>P</sup> = (*V,E,* ) and a vertex *<sup>v</sup>* <sup>∈</sup> *<sup>V</sup>* , the *subproof of* <sup>P</sup> *with sink <sup>v</sup>* is the largest subgraph <sup>P</sup>*<sup>v</sup>* = (*Vv, Ev, v*) of <sup>P</sup> where *<sup>V</sup><sup>v</sup>* contains all vertices in *<sup>V</sup>* that have a path to *<sup>v</sup>* in <sup>P</sup>.

#### **2.2 Derivers**

In practice, proofs and derivation structures are constructed by a reasoning system, and in theoretical investigations, it is common to define proofs by means of a calculus. To abstract from these details, we use the concept of a *deriver* as in [2], which is a function that, given a theory <sup>T</sup> and a conclusion *<sup>η</sup>*, produces the corresponding derivation structure in which we can look for an optimal proof. However, in practice, it would be inefficient and unnecessary to compute the entire derivation structure beforehand when looking for an optimal proof. Instead, we allow to access elements in a derivation structure using an oracle, which we can ask whether given inferences are a part of the current derivation structure. Similar functionality exists for example for the DL reasoner Elk [19], and may correspond to checking whether the inference is an instance of a rule in the calculus. Since reasoners may not be complete for proving arbitrary sentences of <sup>L</sup>, we restrict the conclusion *<sup>η</sup>* to a subset *<sup>C</sup>*<sup>L</sup> ⊆ S<sup>L</sup> of supported consequences.

**Definition 4 (Deriver).** *<sup>A</sup>* deriver <sup>D</sup> *is given by a set <sup>C</sup>*<sup>L</sup> ⊆ S<sup>L</sup> *and a function that assigns derivation structures to pairs* (<sup>T</sup> *, η*) *of finite theories* T ⊆S<sup>L</sup> *and sentences <sup>η</sup>* <sup>∈</sup> *<sup>C</sup>*L*, such that* T |<sup>=</sup> *<sup>η</sup> iff* <sup>D</sup>(<sup>T</sup> *, η*) *contains a proof for* T |<sup>=</sup> *<sup>η</sup>. A* 296 Alrabbaa, Baader, Borgwardt, Koopmann, Kovtunova

$$\begin{array}{c} \mathsf{CR1} \ \overline{\,\,K \sqsubseteq A} \end{array} \text{if } A \in K \text{ and } K \text{ appears in } \mathcal{T}' $$
 
$$\begin{array}{c} \mathsf{CR2} \ \overline{\,\,M \sqsubseteq A \text{ for all } A \in K, \,\,K \sqsubseteq C} \\\\ \mathsf{CR3} \ \overline{\,\,M \sqsubseteq C} \end{array} \text{if } M \text{ appears in } \mathcal{T}' $$
 
$$\begin{array}{c} \mathsf{CR3} \ \overline{\,\,M \sqsubseteq \exists r. L \quad L \sqsubseteq \forall r^{-}. A} \\\\ \mathsf{CR4} \ \overline{\,\,M \sqsubseteq A} \end{array} \qquad \begin{array}{c} \mathsf{CR4} \ \overline{\,\,\!L \sqsubseteq \exists r. M \quad L \sqsubseteq \forall r. A} \\ \end{array} \begin{array}{c} \mathsf{CR5} \ \overline{\,\,M \sqsubseteq \forall r. A} \\ \end{array} \begin{array}{c} \mathsf{CR6} \ \overline{\,\,M \sqsubseteq \forall r. A} \\ \end{array} \end{array} $$

**Fig. 4.** The inference rules for ELI [9]. Given a finite theory T in a certain normal form, the rules produce a saturated theory T . Here, *K, L, M* are conjunctions of concept names, *<sup>A</sup>* is a concept name, *<sup>C</sup>* is an ELI concept of the form *<sup>A</sup>*, <sup>∃</sup>*r.M*, or <sup>∀</sup>*r.A*, and *<sup>r</sup>* is a role name or the inverse of a role name. In this calculus conjunctions are implicitly viewed as sets, i.e. the order and multiplicity of conjuncts is ignored.

*proof* <sup>P</sup> *for* T |<sup>=</sup> *<sup>η</sup> is called* admissible w.r.t. <sup>D</sup>(<sup>T</sup> *, η*) *if there is a homomorphism <sup>h</sup>*: P → <sup>D</sup>(<sup>T</sup> *, η*)*. We call* <sup>D</sup> *<sup>a</sup>* polynomial deriver *if there exists a polynomial <sup>p</sup>*(*x*) *such that the size of* <sup>D</sup>(<sup>T</sup> *, η*) *is bounded by <sup>p</sup>*(|T | <sup>+</sup> <sup>|</sup>*η*|)*.* Exponential derivers *are defined similarly by the restriction* <sup>|</sup>D(<sup>T</sup> *, η*)| ≤ <sup>2</sup>*<sup>p</sup>*(|T |+|*η*|) *.*

Elk is an example of a polynomial deriver, that is, for a given EL theory <sup>T</sup> and EL sentence *<sup>η</sup>*, Elk(<sup>T</sup> *, η*) contains all allowed instances of the rules shown in Figure 1. As an example for an exponential deriver we use Eli, which uses the rules from Figure 4 and is complete for ELI theories and conclusions of the form *<sup>A</sup> <sup>B</sup>*, *<sup>A</sup>*, *<sup>B</sup>* <sup>∈</sup> <sup>N</sup>C. The oracle access for a deriver <sup>D</sup> works as follows. Let <sup>D</sup> = (*V,E,* ) := <sup>D</sup>(<sup>T</sup> *, η*) and *<sup>V</sup>* <sup>=</sup> {*v*1*,...,v<sup>m</sup>*}. <sup>D</sup> is accessed using the following two functions, where *i, i*1*,...,i<sup>l</sup>* are indices of vertices and *α* is a sentence:

$$\begin{aligned} [\mathcal{D}](i\_1, \ldots, i\_l, i) &:= \begin{cases} \mathtt{true} & \text{if } (\{v\_{i\_1}, \ldots, v\_{i\_l}\}, v\_i) \in E, \\ \mathtt{false} & \text{otherwise}; \end{cases} \\ [\mathcal{D}](i, \alpha) &:= \begin{cases} \mathtt{true} & \text{if } \ell(v\_i) = \alpha, \\ \mathtt{false} & \text{otherwise}. \end{cases} \end{aligned}$$

In this paper, we focus on polynomial and exponential derivers, for which we further make the following technical assumptions: 1) <sup>D</sup>(<sup>T</sup> *, η*) does not contain two vertices with the same label; 2) the number of premises in an inference is polynomially bounded by |T | and <sup>|</sup>*η*|; and 3) the size of each label is polynomially bounded by |T | and <sup>|</sup>*η*|. While 1) is without loss of generality, 2) and 3) are not. If a deriver does not satisfy 2), we may be able to fix this by splitting inference steps. Assumption 3) would not work for derivers with higher complexity, but is required in our setting to avoid trivial complexity results for exponential derivers. We furthermore assume that for polynomial and exponential derivers, the polynomial *p* from Definition 4 bounding the size of derivation structures is known.

## **3 Measuring Proofs**

To formally study quality measures for proofs, we developed the following definition, which will be instantiated with concrete measures later. Our goal is to find proofs that minimize these measures, i.e. lower numbers are better.

**Definition 5 (**Φ**-Measure).** *<sup>A</sup>* (quality) measure *is a function* <sup>m</sup>: <sup>P</sup><sup>L</sup> <sup>→</sup> <sup>Q</sup>≥0*, where* <sup>P</sup><sup>L</sup> *is the set of all proofs over* <sup>L</sup> *and* <sup>Q</sup>≥<sup>0</sup> *is the set of non-negative rational numbers. We call* m *a* Φ-measure *if, for every* P ∈ PL*, the following hold.*


Intuitively, a Φ-measure m does not increase when the proof gets smaller, either when parts of the proof are removed (to obtain a subproof) or when parts are merged (in a homomorphic image). For example, <sup>m</sup>size((*V,E,* )) := <sup>|</sup>*<sup>V</sup>* <sup>|</sup> is a Φ-measure, called the *size* of a proof, and we have already investigated the complexity of the following deicision problem for msize in [2].

**Definition 6 (Optimal Proof).** *Let* D *be a deriver and* m *be a measure. Given a finite theory* <sup>T</sup> *and a sentence <sup>η</sup>* <sup>∈</sup> *<sup>C</sup>*<sup>L</sup> *s.t.* T |<sup>=</sup> *<sup>η</sup>, an admissible proof* <sup>P</sup> *w.r.t.* <sup>D</sup>(<sup>T</sup> *, η*) *is called* optimal *w.r.t.* <sup>m</sup> *if* <sup>m</sup>(P) *is minimal among all such proofs. The associated decision problem, denoted* OP(D*,* <sup>m</sup>)*, is to decide, given* <sup>T</sup> *and <sup>η</sup> as above and <sup>q</sup>* <sup>∈</sup> <sup>Q</sup>≥<sup>0</sup>*, whether there is an admissible proof* <sup>P</sup> *w.r.t.* <sup>D</sup>(<sup>T</sup> *, η*) *with* <sup>m</sup>(P) <sup>≤</sup> *<sup>q</sup>.*

For our complexity analysis, we distinguish the encoding of *q* with a subscript (unary/binary), e.g. OPunary(D*,* m).

We first show that if <sup>P</sup> is optimal w.r.t. a <sup>Φ</sup>-measure <sup>m</sup> and <sup>D</sup>(<sup>T</sup> *, η*), then the homomorphic image of <sup>P</sup> in <sup>D</sup>(<sup>T</sup> *, η*) is also a proof. Thus, to decide OP(D*,* <sup>m</sup>) we can restrict our search to proofs that are subgraphs of <sup>D</sup>(<sup>T</sup> *, η*).

**Lemma 7.** *For any deriver* D *and* Φ*-measure* m*, if there is an admissible proof* <sup>P</sup> *w.r.t.* <sup>D</sup>(<sup>T</sup> *, η*) *with* <sup>m</sup>(P) <sup>≤</sup> *<sup>q</sup> for some <sup>q</sup>* <sup>∈</sup> <sup>Q</sup>≥<sup>0</sup>*, then there exists a subproof* <sup>Q</sup> *of* <sup>D</sup>(<sup>T</sup> *, η*) *for* T |<sup>=</sup> *<sup>η</sup> with* <sup>m</sup>(Q) <sup>≤</sup> *<sup>q</sup>.*

In particular, this shows that an optimal proof always exists.

**Corollary 8.** *For any deriver* <sup>D</sup> *and* <sup>Φ</sup>*-measure* <sup>m</sup>*, if* T |<sup>=</sup> *<sup>η</sup>, then there is an optimal proof for* T |<sup>=</sup> *<sup>η</sup> w.r.t.* <sup>D</sup> *and* <sup>m</sup>*.*

*Proof.* By Definition 4, the derivation structure <sup>D</sup>(<sup>T</sup> *, η*) contains at least one proof for T |<sup>=</sup> *<sup>η</sup>*. Since <sup>D</sup>(<sup>T</sup> *, η*) is finite, there are finitely many proofs for T |<sup>=</sup> *<sup>η</sup>* contained in <sup>D</sup>(<sup>T</sup> *, η*). The finite set of all <sup>m</sup>-weights of these proofs always has a minimum. Finally, if there were an admissible proof weighing less than this minimum, it would contradict Lemma 7.

#### **3.1 Monotone Recursive Measures**

Since the complexity of OP(D*,* m) for Φ-measures in general is quite high [2], in this paper we focus on a subclass of measures that can be evaluated recursively.

**Definition 9.** *A* Φ*-measure* m *is* recursive *if there exist*


*such that, for any proof* <sup>P</sup> = (*V,E,* ) *with sink <sup>v</sup>, we have*

$$\mathfrak{m}(\mathcal{P}) = \begin{cases} \mathsf{leaf}\_{\mathfrak{m}}(\ell(v)) & \text{if } V = \{v\}, \\ \mathsf{edge}\_{\mathfrak{m}}\left(\ell(S, v), \{\mathfrak{m}(\mathcal{P}\_{w}) \mid w \in S\}\right) & \text{if } (S, v) \in E. \end{cases}$$

*Such a measure is* monotone *if, for any multiset* <sup>Q</sup>*, whenever <sup>q</sup>* ∈ Q *and* <sup>Q</sup> = (Q\{*q*})∪{*q* } *with <sup>q</sup>* <sup>≤</sup> *<sup>q</sup> and both* edge<sup>m</sup> (S*, α*)*,* <sup>Q</sup> *and* edge<sup>m</sup> (S*, α*)*,* <sup>Q</sup> *are defined, then* edge<sup>m</sup> (S*, α*)*,* <sup>Q</sup> ≤ edge<sup>m</sup> (S*, α*)*,* <sup>Q</sup> *.*

Intuitively, a recursive measure m can be computed in a bottom-up fashion starting with the weights of the leaves given by leafm. The function edge<sup>m</sup> is used to recursively combine the weights of the direct subproofs into a weight for the full proof. This function is well-defined since in a proof every vertex has at most one incoming edge. We require edge<sup>m</sup> to be defined only for inputs (S*, α*)*,* <sup>Q</sup> that actually correspond to a valid proof in <sup>L</sup>, i.e. where S |=<sup>L</sup> *<sup>α</sup>* and Q consists of the weights of some proofs for the sentences in S. For example, if m always yields natural numbers, we obviously do not need edge<sup>m</sup> to be defined for multisets containing fractional numbers.

In this paper, we are particularly interested in the following monotone recursive Φ-measures.

**–** The *depth* mdepth of a proof is defined by

$$\mathsf{Adv}\_{\mathsf{m}\_{\mathsf{depth}}}(\alpha) := 0 \text{ and } \mathsf{edge}\_{\mathsf{m}\_{\mathsf{depth}}}((\mathcal{S}, \alpha), \mathcal{Q}) := 1 + \max \mathcal{Q}.$$

**–** The *tree size* mtree is given by

$$\mathsf{Verf}\_{\mathfrak{m}\_{\mathsf{tm}}}(\alpha) := 1 \text{ and } \mathsf{edge}\_{\mathfrak{m}\_{\mathsf{tm}}}((\mathcal{S}, \alpha), \mathcal{Q}) := 1 + \sum \mathcal{Q}.$$

What distinguishes *tree size* from *size* is that vertices are counted multiple times if they are used in several subproofs. The name *tree size* is inspired by the fact that it can be interpreted as the *size* of the tree unraveling of a given proof (cf. Figures 2 and 3). In fact, we show in the extended version [4] that all recursive Φ-measures are invariant under unraveling. This indicates that *tree size*, *depth* and other monotone recursive Φ-measures are especially well-suited for cases where proofs are presented to users in the form of trees. This is for example the case for the proof plugin for Protégé [20].

**Lemma 10.** Depth *and* tree size *are monotone recursive* Φ*-measures.*

#### **Algorithm 1:** A Dijkstra-like algorithm

**Input:** A derivation structure <sup>D</sup>(<sup>T</sup> *, η*)=(*V,E,* ), a monotone recursive Φ-measure m **Output:** An optimal proof of T |<sup>=</sup> *<sup>η</sup>* w.r.t. <sup>D</sup>(<sup>T</sup> *, η*) and <sup>m</sup> **<sup>1</sup>** *<sup>Q</sup>* := <sup>∅</sup> **foreach** *<sup>e</sup>* <sup>∈</sup> *<sup>E</sup>* **do** *<sup>k</sup>*(*e*) := 0 **foreach** *<sup>v</sup>* <sup>∈</sup> *<sup>V</sup>* **do if** (*v*) ∈ T **then** <sup>P</sup>(*v*) := ({*v*}*,* <sup>∅</sup>*,* |{*v*}); *<sup>Q</sup>* := *<sup>Q</sup>* ∪ {*v*} // (*v*) is in the theory **else if** (∅*, v*) <sup>∈</sup> *<sup>E</sup>* **then** <sup>P</sup>(*v*) := ({*v*}*,* {(∅*, v*)}*,* |{*v*}); *<sup>Q</sup>* := *<sup>Q</sup>* ∪ {*v*} // (*v*) is a tautology **<sup>8</sup> else** <sup>P</sup>(*v*) := undefined **while** *<sup>Q</sup>* <sup>=</sup> <sup>∅</sup> **do** choose *<sup>v</sup>* <sup>∈</sup> *<sup>Q</sup>* with minimal <sup>m</sup>(P(*v*)) // <sup>P</sup>(*v*) is optimal for (*v*) *<sup>Q</sup>* := *<sup>Q</sup>* \ {*v*} **foreach** *<sup>e</sup>* = (*S, d*) <sup>∈</sup> *<sup>E</sup>* with *<sup>v</sup>* <sup>∈</sup> *<sup>S</sup>* **do** *k*(*e*) := *k*(*e*)+1 **if** *<sup>k</sup>*(*e*) = <sup>|</sup>*S*<sup>|</sup> **then** // all source vertices have been reached <sup>P</sup> := (*<sup>S</sup>* ∪ {*d*}*, e, <sup>S</sup>*∪{*d*}) <sup>∪</sup> *<sup>s</sup>*∈*<sup>S</sup>* <sup>P</sup>(*s*) // construct new proof **if** P is acyclic **then if** <sup>P</sup>(*d*) is undefined or <sup>m</sup>(P(*d*)) *<sup>&</sup>gt;* <sup>m</sup>(P) **then** <sup>P</sup>(*d*) := <sup>P</sup>; *<sup>Q</sup>* := *<sup>Q</sup>* ∪ {*d*} // <sup>P</sup> is better for (*d*) **return** <sup>P</sup>(*vη*), where (*vη*) = *<sup>η</sup>*

#### **4 Complexity Results**

We investigate the decision problem OP for monotone recursive Φ-measures. We first show upper bounds for the general case, and then consider measures for *depth* and *tree size*, for which we obtain even lower bounds. An artificial modification of the *depth* measure gives a lower bound matching the general upper bound even if unary encoding is used for the threshold *q*.

#### **4.1 The General Case**

Algorithm 1 describes a Dijkstra-like approach that is inspired by the algorithm in [13] for finding minimal hyperpaths w.r.t. so-called *additive weighting functions*, which represent a subclass of monotone recursive Φ-measures. The algorithm progressively discovers proofs <sup>P</sup>(*v*) for (*v*) that are contained in <sup>D</sup>(<sup>T</sup> *, η*). If it reaches a new vertex *v* in this process, this vertex is added to the set *Q*. In each step, a vertex with minimal weight <sup>m</sup>(P(*v*)) is chosen and removed from *<sup>Q</sup>*. For each hyperedge *<sup>e</sup>* = (*S, d*) <sup>∈</sup> *<sup>E</sup>*, a counter *<sup>k</sup>*(*e*) is maintained that is increased whenever a vertex *<sup>v</sup>* <sup>∈</sup> *<sup>S</sup>* is chosen. Once this counter reaches <sup>|</sup>*S*|, we know that all source vertices of *e* have been processed. The algorithm then constructs a new proof <sup>P</sup> for (*d*) by joining the proofs for the source vertices using the

current hyperedge *<sup>e</sup>*. This proof <sup>P</sup> is then compared to the best previously known proof <sup>P</sup>(*d*) for (*d*) and <sup>P</sup>(*d*) is updated accordingly. For Line 20, recall that we assumed <sup>D</sup>(<sup>T</sup> *, η*) to contain no two vertices with the same label, and hence it contains a unique vertex *v<sup>η</sup>* with label *η*.

**Lemma 11.** *For any monotone recursive* Φ*-measure* m *and deriver* D*, Algorithm <sup>1</sup> computes an optimal proof in time polynomial in the size of* <sup>D</sup>(<sup>T</sup> *, η*)*.*

Since we can actually compute an optimal proof in polynomial time in the size of the whole derivation structure, it is irrelevant how the upper bound *q* in the decision problem OP is encoded, and hence the following results follow.

**Theorem 12.** *For any monotone recursive* Φ*-measure* m *and polynomial deriver* D*,* OPbinary(D*,* m) *is in* P*. It is in* ExpTime *for all exponential derivers* D*.*

#### **4.2 Proof Depth**

We now consider the measure mdepth in more detail. We can show lower bounds of P and ExpTime for polynomial and exponential derivers, respectively, although the latter only holds for upper bounds *q* encoded in binary.

Since our definition of OP(D*,* <sup>m</sup>) requires that the input entailment T |<sup>=</sup> *<sup>η</sup>* already holds, we cannot use a straightforward reduction from the entailment problem in EL or ELI, however. Instead, we show that ordinary proofs P for T |<sup>=</sup> *<sup>η</sup>* satisfy <sup>m</sup>(P) <sup>≤</sup> *<sup>q</sup>* for some *<sup>q</sup>*, and then extend the TBox to <sup>T</sup> in order to create an artificial proof P with m(P ) *> q*. In this way, we ensure that <sup>T</sup> <sup>|</sup><sup>=</sup> *<sup>η</sup>* holds and can use *q* to distinguish the artificial from the original proofs.

For ELI, we can use an observation from [9, Example 6.29] for this purpose.

**Proposition 13 ( [9]).** *For every <sup>q</sup>* <sup>∈</sup> <sup>Q</sup>≥<sup>0</sup> *and* ELI *sentence of the form <sup>A</sup> <sup>B</sup>, where A, B* <sup>∈</sup> <sup>N</sup>C*, one can construct in time polynomial in <sup>q</sup> an* ELI *theory* <sup>T</sup> *such* T |<sup>=</sup> *<sup>A</sup> <sup>B</sup>, and every proof for* T |<sup>=</sup> *<sup>A</sup> <sup>B</sup> in* Eli *is of depth larger than* 2*<sup>q</sup>.*

We can now reduce the entailment problems for EL and ELI to obtain the claimed lower bounds.

**Theorem 14.** *The problems* OPunary(Elk*,* mdepth) *and* OPbinary(Eli*,* mdepth) *are* P*-hard and* ExpTime*-hard, respectively.*

*Proof.* For the P-hardness, we provide a LogSpace-reduction from the entailment problem of a GCI *<sup>A</sup> <sup>B</sup>* with two concept names *A, B* from an EL-theory <sup>T</sup> , which is P-hard [9]. To reduce this problem to OPunary(Elk*,* mtree), we need to find a theory <sup>T</sup> and a number *<sup>q</sup>* such that <sup>T</sup> <sup>|</sup><sup>=</sup> *<sup>A</sup> <sup>B</sup>* holds, and moreover T |<sup>=</sup> *<sup>A</sup> <sup>B</sup>* holds iff Elk(<sup>T</sup> *, A <sup>B</sup>*) contains a proof of <sup>T</sup> <sup>|</sup><sup>=</sup> *<sup>A</sup> <sup>B</sup>* of depth <sup>≤</sup> *<sup>q</sup>* (cf. Lemma 7).

First, observe that, since proofs must be acyclic, the depth of any proof of *<sup>A</sup> <sup>B</sup>* from <sup>T</sup> is bounded by *<sup>q</sup>* := <sup>|</sup>Elk(<sup>T</sup> *, A <sup>B</sup>*)|, whose size in unary encoding is polynomial in the size of T . We now construct

$$\mathcal{T}' := \mathcal{T} \cup \{ A \subseteq A\_1, \ A\_1 \subseteq A\_2, \dots, A\_{q+2} \subseteq B \},$$

where *<sup>A</sup>*1*,...,A<sup>q</sup>* are concept names not occurring in <sup>T</sup> . Clearly, we have <sup>T</sup> <sup>|</sup><sup>=</sup> *<sup>A</sup> <sup>B</sup>*. Furthermore, the existence of an admissible proof for <sup>T</sup> <sup>|</sup><sup>=</sup> *<sup>A</sup> <sup>B</sup>* of depth at most *<sup>q</sup>* is equivalent to T |<sup>=</sup> *<sup>A</sup> <sup>B</sup>*, since any proof that uses the new concept names must take *<sup>q</sup>* + 1 consecutive steps using rule <sup>R</sup>, i.e. must be of depth *q* + 1. Moreover, we can compute *q* (in binary representation) and output it in unary representation using a logarithmically space-bounded Turing machine, and similarly for T . Hence, the above construction constitutes the desired LogSpace-reduction.

For the remaining result, we can use similar arguments about the exponential deriver Eli, where entailment is ExpTime-hard [9]:


To demonstrate that the generic upper bounds from Theorem 12 are tight even for unary encoding, we quickly consider the artificial measure mlog(depth) (*logarithmic depth*), which simply computes the (binary) logarithm of the depth of a given proof. This is also a monotone recursive Φ-measure, since the logarithmic depth contains exactly the same information as the depth itself. It is easy to obtain the following lower bounds from the previous results about mdepth.

**Corollary 15.** OPunary(Elk*,* mlog(depth)) *is* P*-hard and* OPunary(Eli*,* mlog(depth)) *is* ExpTime*-hard.*

*Proof.* For any deriver D, OPbinary(D*,* mdepth) can be LogSpace-reduced to OPunary(D*,* mlog(depth)), because in order to find a proof of depth at most *q* (with *q* given in binary), one can equivalently look for a proof whose logarithmic depth is bounded by the value log *q*. The unary encoding of log *q* has the same size as the binary encoding of *q* and can be computed in LogSpace by flipping all bits of the binary encoding of *<sup>q</sup>* to <sup>1</sup>.

We now return to mdepth and cover the remaining case of exponential derivers and unary encoding of the upper bound *q*.

**Theorem 16.** OPunary(D*,* mdepth) *is in* PSpace *for any exponential deriver* D*. It is* PSpace*-hard for the exponential deriver* D = Eli*.*

*Proof.* For the upper bound, we employ a depth-first guessing strategy: we guess a proof of depth at most *q*, where at each time point we only keep one branch of the proof in memory. As the length of this branch is bounded by *q*, and due to our assumptions on derivers, this procedure only requires polynomial space.

For the lower bound, we provide a reduction from the PSpace-complete QBF problem (satisfiability of quantified Boolean formulas). Let Q1*x*1Q2*x*<sup>2</sup> *...* Q*mxm.φ* be a quantified Boolean formula, where for *<sup>i</sup>* ∈ {1*,...,m*}, <sup>Q</sup>*<sup>i</sup>* ∈ {∃*,* ∀}, and *<sup>φ</sup>* is

a formula over {*x*1*,...,xm*}. We assume *<sup>φ</sup>* to be in negation normal form, that is, negation only occurs directly in front of a variable. We construct an ELI theory <sup>T</sup> and a number *<sup>q</sup>*, both of size polynomial in the size of the formula, such that T |<sup>=</sup> *<sup>A</sup> <sup>B</sup>* holds (cf. Definition 6) and <sup>T</sup> has a proof for *<sup>A</sup> <sup>B</sup>* of depth *<sup>q</sup>* iff the QBF formula is valid. We use two roles *r*1, *r*<sup>2</sup> to deal with the variable valuations, concept names *A*0, *...*, *A<sup>m</sup>* to count the quantifier nesting, and a concept name *A<sup>ψ</sup>* for every sub-formula *ψ* of *φ*. In addition, we use the concept names *A* and *B* occurring in the conclusion, and two concept names *B*<sup>1</sup> and *B*2.

The concept name *A* initializes the formula at quantifier nesting level 0:

$$A \subseteq A\_0$$

For every *<sup>i</sup>* ∈ {1*,...,m*}, <sup>T</sup> contains the following sentence to select a truth valuation for *xi*, increasing the nesting depth in each step.

$$A\_{i-1} \subseteq \exists r\_1. (A\_i \sqcap A\_{x\_i}) \tag{1}$$

$$A\_{i-1} \sqsubseteq \exists r\_2. (A\_i \sqcap A\_{\neg x\_i}). \tag{2}$$

To ensure truth valuations are kept along the role-successors, we use the following sentences for every *<sup>l</sup>* ∈ {*xi,* <sup>¬</sup>*x<sup>i</sup>* <sup>|</sup> <sup>1</sup> <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>m</sup>*}:

$$A\_l \sqsubseteq \forall r\_1. A\_l \qquad A\_l \sqsubseteq \forall r\_2. A\_l \tag{3}$$

The following GCIs are now used to evaluate *<sup>φ</sup>*. For every conjunction *<sup>ψ</sup>* <sup>=</sup> *<sup>ψ</sup>*1∧*ψ*<sup>2</sup> occurring in *φ*, we use:

$$A\_{\psi\_1} \sqcap A\_{\psi\_2} \sqsubseteq A\_{\psi},\tag{4}$$

and for every disjunction *<sup>ψ</sup>* <sup>=</sup> *<sup>ψ</sup>*<sup>1</sup> <sup>∨</sup> *<sup>ψ</sup>*2, we use:

$$A\_{\psi\_1} \sqsubseteq A\_{\psi} \qquad A\_{\psi\_2} \sqsubseteq A\_{\psi} \tag{5}$$

Finally, the following GCIs are used to propagate the result of the evaluation back towards the start.

$$A\_{\phi} \subseteq B$$

$$A\_i \sqcap B \sqsubseteq \forall r\_1^-. B \qquad \qquad A\_i \sqcap B \sqsubseteq \forall r\_2^-. B \tag{7}$$

$$A\_i \sqcap B \sqsubseteq \forall r\_1^-. B\_1 \qquad \quad A\_i \sqcap B \sqsubseteq \forall r\_2^-. B\_2 \qquad \quad B\_1 \sqcap B\_2 \sqsubseteq B \qquad \quad \text{if } \mathbb{Q}\_i = \forall \tag{8}$$

One can now show that there exists a proof for *<sup>A</sup> <sup>B</sup>* from <sup>T</sup> of depth at most *<sup>q</sup>* iff the QBF formula is valid, where *q* is polynomial and determined by the size and structure of *<sup>φ</sup>*. Finally, we can extend <sup>T</sup> with the sentences from Proposition <sup>13</sup> to ensure that T |<sup>=</sup> *<sup>A</sup> <sup>B</sup>* holds while retaining this equivalence.

#### **4.3 The Tree Size Measure**

The tree size measure was discussed already in [2], where tight bounds were provided for polynomial derivers and exponential derivers with unary encoding. For the case of exponential derivers with binary encoding, only an ExpTime upper bound was provided, and the precise complexity left open. We improve this result by showing that OPbinary(D*,* mtree) can indeed be decided in PSpace.

**Fig. 5.** Illustration of the argument used for Theorem 17. On the top, the partially guessed proof tree for two consecutive steps of the algorithm is shown, where the dark nodes are what is currently kept in memory. On the bottom, we see how the corresponding tuples are organized into a tree satisfying Conditions **S1**–**S6**.

**Theorem 17.** *For any exponential deriver* D*,* OPbinary(D*,* mtree) *is in* PSpace*.*

*Proof (sketch).* We describe a non-deterministic procedure for OPbinary(D*,* mtree), in polynomial space. Let <sup>T</sup> be a theory, *<sup>η</sup>* the goal sentence, and *<sup>q</sup>* a rational number in binary encoding. By Lemma 7, it suffices to find a proof <sup>P</sup> for T |<sup>=</sup> *<sup>η</sup>* in <sup>D</sup>(<sup>T</sup> *, η*) with <sup>m</sup>tree(P) <sup>≤</sup> *<sup>q</sup>*. The procedure guesses such a proof starting from the conclusion, while keeping in memory a set *S* of tuples (*η , q* ), where *η* is a sentence and *<sup>q</sup>* <sup>≤</sup> *<sup>q</sup>* a rational number. Intuitively, such a tuple states: "We still need to guess a proof for *η* of tree size at most *q* ."

	- (a) select from *S* a tuple (*η , q* ) such that for all tuples (*η, q*) <sup>∈</sup> *<sup>S</sup>* it holds that *<sup>q</sup>* <sup>≥</sup> *<sup>q</sup>* ;
	- (b) guess a hyperedge ({*v*1*,...,v<sup>m</sup>*}*, v* ) in <sup>D</sup>(<sup>T</sup> *, η*) (using the oracle access described in Section 2.2) and *m* numbers *q*1, *...*, *qm*, such that (*v* ) = *η* and *<sup>q</sup>*<sup>1</sup> <sup>+</sup> *...* <sup>+</sup> *<sup>q</sup><sup>m</sup>* + 1 <sup>≤</sup> *<sup>q</sup>* ; and
	- (c) replace (*η , q* ) in *S* by the tuples ((*v*1)*, q*1), *...*, ((*vm*)*, qm*).

There is a proof for T |<sup>=</sup> *<sup>η</sup>* of tree size at most *<sup>q</sup>* iff every step in the algorithm is successful. To show that it only requires polynomial space, we show that during the computation, the number of elements in *S* is always polynomially bounded. For this, we show that the elements in *S* can always be organized into a tree with the following properties:


**S6** for every node labeled (*η , q* ) with children labeled (*η*1*, q*1), *...*, (*ηm, qm*), we have *q*<sup>1</sup> + *...* + *q<sup>m</sup> < q* .

We prove this by induction on the steps of the algorithm, where in each step, we either replace one tuple in the tree, or put the new tuples under the leaf with the currently smallest value (see Fig.5). By **S3** and because every number in *S* is bounded by *q*, we can show that the tree has depth at most log<sup>2</sup> *q*, which with **S4** and **S5** implies that it has at most *<sup>p</sup>* · log<sup>2</sup> *<sup>q</sup>* nodes. **S2** then implies that that <sup>|</sup>*S*| ≤ *<sup>p</sup>* · log<sup>2</sup> *<sup>q</sup>* is always satisfied, and thus that *<sup>S</sup>* is polynomially bounded.

A corresponding lower bound can be found for the exponential deriver Eli by a reduction of the word problem for deterministic Turing machines with polynomial space bound.

**Theorem 18.** *For the exponential deriver* Eli*,* OPbinary(Eli*,* mtree) *is* PSpace*hard.*

*Proof (sketch).* Let *T* = (*Q, Γ,* ✁*b, Σ, δ, q*0*, F*) be a deterministic Turing machine, where *<sup>Q</sup>* is the set of states, *<sup>Γ</sup>* the tape alphabet, ✁*<sup>b</sup>* <sup>∈</sup> *<sup>Γ</sup>* the blank symbol, *<sup>Σ</sup>* <sup>⊆</sup> *<sup>Γ</sup>* the input alphabet, *<sup>δ</sup>* : *<sup>Q</sup>* <sup>×</sup> *<sup>Γ</sup>* <sup>→</sup> *<sup>Q</sup>* <sup>×</sup> *<sup>Γ</sup>* × {−1*,* <sup>0</sup>*,* +1} the partial transition function, *<sup>q</sup>*<sup>0</sup> the initial state, and *<sup>F</sup>* <sup>⊆</sup> *<sup>Q</sup>* the accepting states. We assume that *<sup>T</sup>* is polynomially space bounded, that is, there is a polynomial *p* such that on input words *<sup>w</sup>* <sup>∈</sup> *<sup>Σ</sup>*∗, *<sup>T</sup>* only accesses the first *<sup>p</sup>*(|*w*|) cells of the tape. For a word *<sup>w</sup>*, we denote by *<sup>w</sup>*[*i*] its *<sup>i</sup>*th letter. For some fixed word *<sup>w</sup>*, we construct a theory <sup>T</sup> using the following names, where *<sup>k</sup>* <sup>=</sup> *<sup>p</sup>*(|*w*|):


For convenience, we present the theory not in the required normal form, but aggregate conjunctions on the right. The following sentence describes the initial configuration.

$$\mathsf{Start} \subseteq S\_{q\_0} \sqcap \prod\_{i=0}^{|w|-1} A\_i^{w[i]} \sqcap \prod\_{i=|w|}^k A\_i^{\not p} \sqcap P\_0^+ \sqcap \prod\_{i=1}^k P\_i^- \tag{9}$$

The transition from one configuration to the next is encoded with the following sentences for every *<sup>i</sup>* ∈ {0*,...,k*} and every (*q, a*) <sup>∈</sup> *<sup>Q</sup>*×*<sup>Γ</sup>* with *<sup>δ</sup>*(*q, a*)=(*q , b, d*):

$$\exists \ S\_q \sqcap A\_i^a \sqcap P\_i^+ \sqsubseteq \exists r. S\_{q'} \sqcap \forall r. A\_i^b \sqcap \forall r. P\_{i+d}^+ \sqcap \bigwedge\_{j \in \{0, \dots, k\} \backslash \{i+d\}} \forall r. P\_j^- \tag{10}$$

$$A\_i^a \sqcap P\_i^- \sqsubseteq \forall r. A\_i^a \tag{11}$$

Finally, we use the following sentences to detect accepting configurations and propagate the information of acceptance back to the initial configuration

$$S\_f \subseteq \mathsf{Accept} \text{ for all } f \in F,\tag{12}$$

$$\mathsf{Acccept} \sqsubseteq \forall r^-.\mathsf{Acccept}\tag{13}$$

One can find a number *q* exponential in *k* and the size of *T* s.t. that there is a proof for T |<sup>=</sup> Start Accept with tree size at most *<sup>q</sup>* iff *<sup>T</sup>* accepts *<sup>w</sup>*. Using Proposition 13, we can extend T to a theory T s.t. T |= Start Accept, while a proof of tree size *<sup>q</sup>* exists iff *<sup>T</sup>* accepts *<sup>w</sup>* (observe that <sup>m</sup>tree(P) <sup>≥</sup> <sup>m</sup>depth(P) holds for all proofs P).

#### **5 Conclusion**

We have investigated the complexity of finding optimal proofs w.r.t. quality measures that satisfy the property of being *monotone recursive*. Two important examples of this class of measures, *depth* and *tree size*, have been considered in detail in combination with exponential and polynomial derivers. The obtained results are promising: given a deriver, the search for an optimal proof for an entailment can be easier than producing all of the proofs by this deriver. The algorithms used to show the upper bounds can serve as building blocks for finding an optimal proof w.r.t. to a monotone recursive measure automatically.

We conjecture that weighted versions of *tree size* and *depth*, where sentences or inference steps can have associated rational weights, are also monotone recursive, and the generic upper bounds established in this paper can be straightforwardly applied to them. However, a more thorough study is required here, since the complexity of the decision problem depends on the exact way in which the weights are employed. This step towards weighted measures is motivated by user studies [1, 15, 24], demonstrating that different types of sentences and logical inferences can be more or less difficult to understand.

*Acknowledgements* This work was supported by the DFG in grant 389792660 as part of TRR 248 (https://perspicuous-computing.science), and QuantLA, GRK 1763 (https://lat.inf.tu-dresden.de/quantla).

#### **References**


Kovacs, L. (eds.) LPAR-23: 23rd International Conference on Logic for Programming, Artificial Intelligence and Reasoning. EPiC Series in Computing, vol. 73, pp. 32–67. EasyChair (2020). https://doi.org/10.29007/nhpp


28. Schlobach, S., Cornet, R.: Non-standard reasoning services for the debugging of description logic terminologies. In: Gottlob, G., Walsh, T. (eds.) Proc. of the 18th Int. Joint Conf. on Artificial Intelligence (IJCAI 2003). pp. 355–362. Morgan Kaufmann, Acapulco, Mexico (2003), http://ijcai.org/Proceedings/03/Papers/053.pdf

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Computing Optimal Repairs of Quantified ABoxes w.r.t. Static EL TBoxes***-*

Franz Baader , Patrick Koopmann , Francesco Kriegel , and Adrian Nuradiansyah

> Theoretical Computer Science, TU Dresden, Dresden, Germany firstname.lastname@tu-dresden.de

**Abstract.** The application of automated reasoning approaches to Description Logic (DL) ontologies may produce certain consequences that either are deemed to be wrong or should be hidden for privacy reasons. The question is then how to repair the ontology such that the unwanted consequences can no longer be deduced. An optimal repair is one where the least amount of other consequences is removed. Most of the previous approaches to ontology repair are of a syntactic nature in that they remove or weaken the axioms explicitly present in the ontology, and thus cannot achieve semantic optimality. In previous work, we have addressed the problem of computing optimal repairs of (quantified) ABoxes, where the unwanted consequences are described by concept assertions of the lightweight DL EL. In the present paper, we improve on the results achieved so far in two ways. First, we allow for the presence of terminological knowledge in the form of an EL TBox. This TBox is assumed to be static in the sense that it cannot be changed in the repair process. Second, the construction of optimal repairs described in our previous work is best case exponential. We introduce an optimized construction that is exponential only in the worst case. First experimental results indicate that this reduces the size of the computed optimal repairs considerably.

#### **1 Introduction**

Description Logics [3] are a well-investigated family of logic-based knowledge representation languages, which are frequently used to formalize ontologies for application domains such as biology and medicine [17]. As the size of ontologies grows, the likelihood of them containing errors increases as well. This is particularly problematic if the data, stored in the ABox, are automatically extracted from text or other sources using natural language processing or machine learning. The reasoning services of DL systems [22,12,33,15], which derive implicit consequences from the explicitly represented knowledge, are not only useful once an ontology is deployed, but can also be employed for debugging purposes by exhibiting consequences that are not supposed to hold in the application

funded by DFG in project number 430150274 and TRR 248 (cpec, grant 389792660).

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5\_18 309–326, 2021.

domain. Another reason why one might want to remove a consequence is that it reveals private information that is supposed to be hidden [14,5]. Once such an unwanted consequence is detected, it is often not easy to see how to repair the ontology in order to get rid of this consequence. Classical repair approaches based on axiom pinpointing [31,29,27,32,21,8] compute maximal subsets of the ontology that do not have the consequence. The obtained result thus strongly depends on the syntactic form of the axioms. For example, it is well-known that, for expressive DLs, a finite set of terminological axioms can be expressed by a single axiom. If the given terminology (TBox) is of this shape, then the only possible classical repair is the empty TBox. To alleviate this problem, repair approaches have been developed that replace certain axioms by weaker ones (in the sense that they have less consequences) instead of removing them completely [18,24,34,6]. However, these approaches usually do not produce optimal repairs. In fact, it was shown in [6] that, even for the inexpressive DL EL, optimal repairs need not exist. The abstract example given there can be rephrased as follows. Assume that the TBox defines humans to be exactly those individuals that have a human parent, and that the ABox says that Sam is a human. After we find out that Sam is in fact not human [9], we want to get rid of the latter assertion, but keep the (correct) consequences saying that Sam has an unbounded chain of ancestors (of undetermined species). If the TBox is assumed to be fixed, then there is no optimal repair of the ABox since we can add only a finite number of parent assertions.

To avoid such problems, our previous work on computing optimal repairs (formulated in the guise of achieving compliance with privacy policies) restricted the attention to the case without TBox. In [5] the ABox was additionally restricted to be a so-called instance store [19], i.e., an ABox without role assertions. The privacy policy (specifying which consequences are to be removed) was given as EL instance queries. In this setting, optimal repairs always exist and can be computed in exponential time, which is optimal since there may be exponentially many optimal repairs of exponential size.

In [7] these results were extended to ABoxes with role assertions. More precisely, we considered *quantified* ABoxes in which some individuals are anonymized by viewing them as existentially quantified variables. For example, assume that the ABox contains the information that Ben has a parent, Jerry, that is both rich and famous, and we want to remove the consequence E *parent.*(*Rich*- *Famous*)(*BEN*). Classical repairs can be obtained by removing one of the assertions *Rich*(*JERRY* ), *Famous*(*JERRY* ), and *parent*(*BEN, JERRY* ). If instead we replace the first assertion with *Rich*(*x*) and *parent*(*BEN, x*) for an existentially quantified variable *x*, then we retain more consequences. Note that we could not have used an individual name (i.e., constant) *ANNE* instead of *x* since information like *Rich*(*ANNE*) about Anne does not follow from the original ABox. We show in [7] that in this setting all optimal repairs can be computed by an exponential-time algorithm with access to an NP-oracle. The oracle is needed since our algorithm first computes a superset of the set of optimal repairs, from which non-optimal ones need to be removed using the (NP-complete) entailment test between (potentially exponentially large) quantified ABoxes. We also consider a modified version of entailment (called IQ-entailment) in [7], where quantified ABoxes are compared w.r.t. which EL instance relationships they imply. Using this notion, no NP-oracle is needed for computing the set of all IQ-optimal repairs since IQ-entailment can be decided in polynomial time.

In the present paper, we improve on these results in two respects. On the one hand, we allow for the presence of terminological knowledge in the form of an EL TBox, which is assumed to be correct, and thus is not changed by the repair. To deal with a TBox, the approach from [7] for computing optimal repairs must be extended in two ways. First, the ABox needs to be saturated w.r.t. the TBox before applying our repair approach. The saturated ABox has the same consequences as the original one has together with the TBox. In our Ben and Jerry example, assume that the assertion *Rich*(*JERRY* ) does not belong to the original ABox, but the TBox contains the axiom *Famous Rich*. Then the ABox on its own does not have the unwanted consequence E *parent.*(*Rich* - *Famous*)(*BEN*), but together with the TBox it does. Saturation adds the assertion *Rich*(*JERRY* ) to the ABox. For arbitrary TBoxes, saturation need not terminate. We consider two ways to remedy this problem: either allow for arbitrary TBoxes, but consider IQ-entailment, or use classical entailment, but consider cycle-restricted TBoxes [1]. In both cases, saturation always terminates; in the former in polynomial and in the latter in exponential time. One might be tempted to assume that, after saturation, one can simply apply the repair approach of [7] unchanged. This is not true, however, since the TBox may re-add assertions that have been removed or replaced by the repair. In our example, where *Rich*(*JERRY* ) is replaced, but *Famous*(*JERRY* ) is left untouched in the repair, the repaired ABox together with the TBox would still have the unwanted consequence. Thus, the repair approach needs to be changed to take this possibility into account.

On the other hand, the construction of optimal repairs described in our previous work [5,7], and extended in this paper such that it can deal with TBoxes, is best case exponential. The second contribution of this paper is the design of a new construction, both for classical and IQ-entailment, that is exponential only in the worst case. We also report on first experimental results, which indicate that this reduces the size of the computed optimal repairs considerably.

Detailed proofs of our results can be found in [4].

## **2 Preliminaries**

Throughout this paper, we assume that *Σ* is a *signature*, which is a disjoint union of sets *Σ*O, *Σ*C, and *Σ*<sup>R</sup> of *object names*, *concept names*, and *role names*. We use symbols *t, u, v, w* to denote object names, *A, B* to denote concept names, and *r, s* to denote role names, all of them possibly with sub- or superscripts.

As in [7], a *quantified ABox (qABox)* E *X.*A over *Σ* consists of a finite subset *X* of *Σ*O, the elements of which are called *variables*, and a *matrix* A, which is a finite set of *concept assertions A*(*u*) where *u* ∈ *Σ*<sup>O</sup> and *A* ∈ *Σ*C, and of *role assertions r*(*u, v*) where *u, v* ∈ *Σ*<sup>O</sup> and *r* ∈ *Σ*R. An non-variable object name in E *X.*A is called an *individual name*, and the set of all these names is denoted as *Σ*I( E *X.*A). We further set *Σ*O( E *X.*A) := *Σ*I( E *X.*A)∪*X*. Traditional DL ABoxes are qABoxes where *X* = ∅; we then write A instead of E ∅*.*A. The matrix of a qABox is such a traditional ABox.

An *interpretation* I of *Σ* is a pair (*Δ*I*,* · <sup>I</sup>), where the *domain Δ*<sup>I</sup> is a nonempty set and the *interpretation function* · <sup>I</sup> maps each *u* ∈ *Σ*<sup>O</sup> to an element *u*<sup>I</sup> of *Δ*I, each *A* ∈ *Σ*<sup>C</sup> to a set *A*<sup>I</sup> ⊆ *Δ*I, and each *r* ∈ *Σ*<sup>R</sup> to a binary relation *r*<sup>I</sup> over *Δ*I. The interpretation I of *Σ* is a *model* of a qABox E *X.*A over *Σ* if there is an interpretation J such that *Δ*<sup>I</sup> = *Δ*<sup>J</sup> , the interpretation functions · I and · <sup>J</sup> coincide on *Σ* \ *X*, and *u*<sup>J</sup> ∈ *A*<sup>J</sup> for each *A*(*u*) ∈ A as well as (*u*<sup>J</sup> *, v*<sup>J</sup> ) ∈ *r*<sup>J</sup> for each *r*(*u, v*) ∈ A.

Following [7], we define EL atoms and EL concept descriptions over *Σ* by simultaneous induction as follows. An EL *atom* is either a concept name *A* ∈ *Σ*<sup>C</sup> or an *existential restriction* E *r.C* for some role name *r* ∈ *Σ*<sup>R</sup> and an EL concept description *C*. An EL *concept description* is a *conjunction* - C where C is a finite set of EL atoms. An EL *concept inclusion* is of the form *C D* for EL concept descriptions *C* and *D*, and an EL *TBox* is a finite set of such concept inclusions. An EL *concept assertion* is an expression *C*(*u*), where *C* is an EL concept description and *u* ∈ *Σ*O.

For each interpretation I of *Σ*, we extend the interpretation function · <sup>I</sup> to EL atoms and EL concept descriptions in the following manner:

$$\begin{array}{l} - \ (\exists r. C)^{\mathcal{T}} := \{\delta \mid \text{there exists some } \gamma \text{ such that } (\delta, \gamma) \in r^{\mathcal{T}} \text{ and } \gamma \in C^{\mathcal{T}}\}, \\\ - \ (\top \mathcal{C})^{\mathcal{T}} := \bigcap \{\ C^{\mathcal{T}} \mid C \in \mathcal{C} \} \text{ where } \bigcap \emptyset = \Delta^{\mathcal{T}}. \end{array}$$

The interpretation I is a *model* of the concept inclusion *C D* (the concept assertion *C*(*u*)) if *C*<sup>I</sup> ⊆ *D*<sup>I</sup> (*u*<sup>I</sup> ∈ *C*I), and of the TBox T if it is a model of each concept inclusion in T .

To make the syntax introduced above more akin to the one usually employed for EL, we denote the empty conjunction - ∅ as (*top concept*), singleton conjunctions -{*C*} as *C*, and conjunctions - C for |C| ≥ 2 as *C*<sup>1</sup> - *...* - *Cn*, where *C*1*,...,C<sup>n</sup>* is an enumeration of the elements of C in an arbitrary order. Since we do not distinguish between the singleton conjunction -{*C*} and the atom *C*, each atom is also a concept description. The set Sub(*C*) of *subconcepts* of an EL concept description *C* is defined as follows: Sub(*A*) := {*A*}, Sub( E *r.C*) := { E *r.C*} ∪ Sub(*C*), and Sub( - C) := { - C} ∪ { Sub(*D*) | *D* ∈C}. The set Atoms(*C*) consists of all atoms contained in Sub(*C*). These two notions are extended to TBoxes and sets of concept assertions in the obvious way.

Let *α, β* be qABoxes, concept inclusions, or concept assertions (possibly not both of the same kind), and T an EL TBox. Then we write I |= *α* if the interpretation I is a model of *α*. We say that *α entails β w.r.t.* T (written *α* |=<sup>T</sup> *β*) if every model of *α* and T is a model of *β*. Furthermore, *α* and *β* are *equivalent w.r.t.* T (written *α* ≡<sup>T</sup> *β*), if *α* |=<sup>T</sup> *β* and *β* |=<sup>T</sup> *α*. In case T = ∅, we will sometimes write |= instead of |=∅. If E ∅*.*∅ |=<sup>T</sup> *C D*, then we also write *C* <sup>T</sup> *D* and say that *C is subsumed by D w.r.t.* T ; in case T = ∅ we simply say that *C* is subsumed by *D*. Two EL concept descriptions are *equivalent w.r.t.* T (written *C* ≡<sup>T</sup> *D*) if they subsume each other w.r.t. T . We write *C* -<sup>T</sup> *D* to indicate that *C* <sup>T</sup> *D*, but *C* ≡<sup>T</sup> *D*. If E *X.*A |=<sup>T</sup> *C*(*a*), then *a* is called an *instance of C* w.r.t. E *X.*A and T . For EL, the subsumption and the instance problem are decidable in polynomial time [2]. However, entailment between qABoxes is NP-complete even w.r.t. the empty TBox [7].

We also use the reduced form *<sup>C</sup><sup>r</sup>* of EL concept descriptions *<sup>C</sup>* [23], which is obtained by removing redundant subdescriptions (see [7] for details). Adapting the results in [23], one can show that *<sup>C</sup>* <sup>≡</sup><sup>∅</sup> *<sup>C</sup><sup>r</sup>* and that *<sup>C</sup>* <sup>≡</sup><sup>∅</sup> *<sup>D</sup>* implies *<sup>C</sup><sup>r</sup>* <sup>=</sup> *<sup>D</sup>r*.

#### **3 A Tale of Two Entailments**

DL-based ontologies are usually accessed through appropriate query languages, where for the purpose of this paper it is sufficient to assume that a query language is given by a fragment of first-order logic. Instead of comparing ontologies w.r.t. the models they have, it thus makes sense to compare them w.r.t. the answers to queries they entail [25]. Given such a query language QL and an EL TBox T , we say that the qABox E *X.*A QL*-entails* the qABox E *Y.*B *w.r.t.* T (written E *X.*A |=<sup>T</sup> QL E *Y.*B) if for each query *ϕ*(*x*1*,...,xk*) ∈ QL and each tuple of individuals (*a*1*,...,ak*) we have that T ∧ E *Y.*B |= *ϕ*(*a*1*,...,ak*) implies T ∧ E *X.*A |= *ϕ*(*a*1*,...,ak*), where we view the TBox and the ABox as first-order formulae and |= is classical first-order entailment (see [25] for more details). We say that two qABox are QL*-equivalent w.r.t.* T if they QL-entail each other w.r.t. T , and denote this equivalence relation as ≡<sup>T</sup> QL.

For EL ontologies, one usually considers instance queries (IQ) or conjunctive queries (CQ). The former are given by EL concept descriptions, viewed as first-order formulae with one free variable. The latter are basically qABoxes of the form E *X.*A, but with the elements of *Σ*I( E *X.*A) viewed as free variables. Replacing these free variables with a tuple of individuals thus yields a qABox in the sense introduced above. In particular, this means that CQ-entailment corresponds to entailment of the same qABoxes (see [7] for more details regarding the connection between conjunctive queries and qABoxes).

#### **3.1 Classical Entailment and** CQ**-Entailment**

Due to the close connection between conjunctive queries and qABoxes mentioned above, it is easy to see that the classical entailment relation |=<sup>T</sup> between qABoxes, as introduced in the previous section, actually coincides with CQentailment |=<sup>T</sup> CQ. To keep the notation more uniform and to distinguish this kind of entailment explicitly from IQ-entailment, we will usually talk about CQentailment and write |=<sup>T</sup> CQ.

Whenever we compare two qABoxes E *X.*A and E *Y.*B, we assume without loss of generality that they are *renamed apart*, which means that *X* is disjoint with *Σ*O( E *Y.*B) and *Y* is disjoint with *Σ*O( E *X.*A), and we further assume that the two qABoxes speak about the same set of individual names *Σ*<sup>I</sup> := *Σ*I( E *X.*A)∪ *Σ*I( E *Y.*B). For the case of an empty TBox, it was shown in [7] that E *X.*A |=<sup>∅</sup> CQ E *Y.*B iff there is a homomorphism from E *Y.*B to E *X.*A. A *homomorphism* from


The --rule has highest priority and the -rule has lowest priority.

Fig. 1: The CQ-saturation rules.

E *Y.*B to E *X.*A is a mapping *h*: *Σ*O( E *Y.*B) → *Σ*O( E *X.*A) such that *h*(*a*) = *a* for each *a* ∈ *Σ*I, *A*(*h*(*u*)) ∈ A for each *A*(*u*) ∈ B, and *r*(*h*(*u*)*, h*(*v*)) ∈ A for each *r*(*u, v*) ∈ B. In order to obtain a similar characterization of entailment for the case of a non-empty TBox T , we need to saturate the given qABox w.r.t. T .

Basically, this saturation performs what is called *the chase* in the database community [26,20,10]. Given an EL TBox T and a qABox E *X.*A, it extends the ABox by new assertions that are implied by the TBox. The rules that realize this are described in Fig. 1. Their rôle is two-fold: whereas the -rule adds new concept assertions that are implied by the ABox together with the TBox, the other two rules break down the complex concept assertions added by this rule into smaller parts.

In general, applying these rules need not terminate; e.g., if applied to the qABox E ∅*.*{*A*(*a*)} for the TBox {*A* E *r.A*}. There are various sufficient conditions that guarantee termination of the chase [13]. Here, we use a condition introduced in [1] in the context of unification in EL.

**Definition 1.** *The* EL *TBox* T *is* cycle-restricted *if there is no non-empty sequence of role names r*1*,...,r<sup>k</sup> and* EL *concept description C such that C* <sup>T</sup> E *r*1*.* ··· E *rk.C.*

As shown in [1], it can be decided in time polynomial whether a given EL TBox is cycle-restricted or not. For cycle-restricted TBoxes, CQ-saturation always terminates.

**Theorem 2.** *Let* T *be a cycle-restricted* EL *TBox and* E *X.*A *a qABox. Then exhaustive application of the* CQ*-saturation rules terminates in exponential time in the size of* E *X.*A *and* T *, and yields a qABox* sat<sup>T</sup> CQ( E *X.*A) *such that the following statements are equivalent for all qABoxes* E *Y.*B*:*


We can show that there are examples where the CQ-saturation of a qABox w.r.t. a cycle-restricted TBox is of exponential size, and thus its computation must take exponential time. Nevertheless, the entailment relation |=<sup>T</sup> CQ can still be decided within NP by adapting results for conjunctive query answering in EL [30].


The --rule has higher precedence than the E -rule, and the latter has higher precedence than the -rule.

Fig. 2: The IQ-saturation rules.

#### **3.2** IQ**-Entailment**

Recall that the qABox E *X.*A IQ-entails the qABox E *Y.*B w.r.t. the EL TBox T if every concept assertion *C*(*a*) entailed w.r.t. T by the latter is also entailed w.r.t. T by the former. In the following we assume again that these two qABoxes are renamed apart. For the case of an empty TBox, it was shown in [7] that E *X.*A |=<sup>∅</sup> IQ E *Y.*B iff there is a simulation from E *Y.*B to E *X.*A. A *simulation* from E *Y.*B to E *X.*A is a relation S ⊆ *Σ*O( E *Y.*B) × *Σ*O( E *X.*A) such that (*a, a*) ∈ S for each *a* ∈ *Σ*<sup>I</sup> and, for each (*u, v*) ∈ S, *A*(*u*) ∈ B implies *A*(*v*) ∈ A and *r*(*u, u* ) ∈ B implies that there exists an object *v* ∈ *Σ*<sup>I</sup> ∪*X* such that (*u , v* ) ∈ S and *r*(*v, v* ) ∈ A. Since checking the existence of a simulation can be done in polynomial time [16], we conclude that IQ-entailment between qABoxes can be decided in polynomial time for the case of an empty TBox.

To extend these results to the case of a non-empty TBox, we again need to saturate the ABox w.r.t. the TBox. But now the saturation rules, given in Fig. 2, are more parsimonious w.r.t. the introduction of new objects. To be more precise, for each existential restriction E *r.C* ∈ Sub(T ), we assume that *x<sup>C</sup>* is a fresh variable not contained in the initial qABox E *X.*A. When applying the E -rule to an assertion of the form ( E *r.C*)(*t*), we always use this variable for the successor object. Due to this restriction, IQ-saturation always terminates, i.e., it is not necessary to impose any restrictions on the TBox. Also note that IQ-saturation basically generates a qABox representation of what is called the *canonical model* in [25, Section 5.2].

**Theorem 3.** *Let* T *be an* EL *TBox and* E *X.*A *a qABox. Then exhaustive application of the* IQ*-saturation rules terminates in polynomial time in the size of* E *X.*A *and* T *, and yields a qABox* sat<sup>T</sup> IQ( E *X.*A) *such that the following statements are equivalent for all qABoxes* E *Y.*B*:*


Since sat<sup>T</sup> IQ( E *X.*A) can be computed in polynomial time and the existence of a simulation can be decided in polynomial time, this shows that the entailment relation |=<sup>T</sup> IQ can be decided in polynomial time.

## **4 Canonical Repairs**

We specify what is to be repaired by a finite set of EL concept assertions, which we call a repair request. A repair is a qABox that does not have any of these assertions as a consequence. This generalizes previous repair approaches [6] in that more than one consequence specified as unwanted is removed in one step. It also encompasses the notion of a privacy policy, as introduced in [7], which specifies forbidden concepts, with the meaning that one should not be able to derive that any of the individuals occurring in the qABox is an instance of such a concept. We assume that the TBox is static (i.e., may not be changed by the repair) and consider both CQ- and IQ-entailment for comparing qABoxes.

**Definition 4.** *Let* T *be an* EL *TBox and* QL ∈ {CQ*,* IQ}*.*


Intuitively, a repair is a qABox that has no new consequences of the specified type (instance relationships or answers to conjunctive queries), and no longer has the consequences forbidden by the repair request. In an optimal repair, a minimal amount of consequences of the specified type is lost. Since there are different options for what to change when repairing a qABox, there may exist several non-equivalent optimal repairs.

In the following, let QL ∈ {CQ*,* IQ} and let T be a fixed TBox, which is assumed to be cycle-restricted if QL = CQ. In addition, let R be a repair request and E *X.*A be the qABox to be QL-repaired for R w.r.t. T . We assume that R does not contain an assertion of the form *C*(*a*) such that <sup>T</sup> *C* since the presence of such an assertions would preclude the existence of a repair. If R satisfies this restriction, then the empty qABox E ∅*.*∅ is always a repair. However, as mentioned in the introduction, this does not imply that there is an optimal repair. We will show that, for the case of IQ-entailment, optimal repairs always exist. For CQ-entailment, this is the case if the TBox T is cycle-restricted. In both cases, the set of optimal repairs covers all repairs in the sense that each repair is entailed by some optimal repair.

As mentioned in the introduction, to deal with TBoxes, the approach for computing so-called canonical repairs from [7] needs to be adapted in two ways. First, one needs to QL-saturate the given qABox w.r.t. the TBox. Second, when computing canonical repairs from sat<sup>T</sup> QL( E *X.*A), the construction needs to ensure that the TBox does not reintroduce consequences that have been removed by the repair. The main idea underlying the construction of canonical repairs is to introduce variables as copies of the objects occurring in sat<sup>T</sup> QL( E *X.*A). Such a variable is of the form *yu,*K, where the first component of the subscript says that this is a copy of the object *u*. The second component K is a set of atoms, with the intuitive meaning that *yu,*<sup>K</sup> must *not* be an instance of any element of K. To avoid introducing unnecessary copies, certain restrictions were imposed in [7] on the sets K. We add a further restriction that takes care of the TBox.

To be more precise, let Sub(R*,* T ) be the set of subconcepts of concept descriptions occurring in R or T , and let Atoms(R*,* T ) be the set of atoms occurring in Sub(R*,* T ). The set K in a variable *yu,*<sup>K</sup> must be a repair type for *u*.

**Definition 5.** *Let* E *Y.*B := sat<sup>T</sup> QL( E *X.*A) *and let u be an object name occurring in* B*. A* repair type *for u is a subset* K *of* Atoms(R*,* T ) *that satisfies the following:*


The first two conditions coincide with the ones in [7]. Basically, 1. says that we only need to remove instance relationships explicitly if they are really there. Condition 2. corresponds to the fact that preventing *D*(*yu,*K) as a consequence also prevents *C*(*yu,*K) if *D* subsumes *C*, and thus *C* ∈ K would be redundant if *D* ∈ K. Condition 3. ensures that instance relationships that are removed due to K cannot be re-introduced by the TBox. It is easy to see that the set of repair types for *u* can be computed in exponential time.

Similarly to the approach in [7], canonical repairs are induced by seed functions. Such a function determines, for each individual, which instance relationships should be prevented in order to obtain a repair.

**Definition 6.** *A* repair seed function *is a function s that maps each individual name b* ∈ *Σ*I( E *X.*A) *to a repair type s*(*b*) *for b that satisfies the following:*

**–** *if C*(*b*) ∈ R *and* sat<sup>T</sup> QL( E *X.*A) |= *C*(*b*)*, then s*(*b*) *contains an atom D such that C* <sup>∅</sup> *D.*

Using our general assumption that the repair request R does not contain a concept assertion *C*(*a*) with <sup>T</sup> *C*, we can show that there is always at least one repair seed function. Each repair seed function induces a repair as follows.

**Definition 7.** *Given a repair seed function s, we define the* canonical QL-repair rep<sup>T</sup> QL( E *X.*A*, s*) induced by *s as the qABox* E *Y.*B *where*

	- **–** *A*(*yu,*K) ∈ B *for each concept assertion A*(*u*) *in* sat<sup>T</sup> QL( E *X.*A) *such that A* ∈ K*,*
	- **–** *r*(*yu,*K*, yv,*L) ∈ B *for each role assertion r*(*u, v*) *in* sat<sup>T</sup> QL( E *X.*A) *such that the following holds for each* E *r.C* ∈ K*: if the matrix of* sat<sup>T</sup> QL( E *X.*A) *entails C*(*v*)*, then the set* L *contains an atom that subsumes C.*

Our construction of canonical repairs based on seed functions is sound and complete in the following sense.

**Proposition 8.** *For each repair seed function s, the induced canonical repair* repT QL( E *X.*A*, s*) *is a* QL*-repair of* E *X.*A *for* R *w.r.t.* T *. Conversely, if* E *Y.*B *is a* QL*-repair of* E *X.*A *for* R *w.r.t.* T *, then there is a repair seed function s such that* rep<sup>T</sup> QL( E *X.*A*, s*) |=<sup>T</sup> QL E *Y.*B*.*

We define the set of all canonical QL-repairs of E *X.*A for R w.r.t. T as

Repairs<sup>T</sup> QL( E *X.*A*,* R) := { rep<sup>T</sup> QL( E *X.*A*, s*) | *s* is a repair seed function }*.*

As an easy consequence of Proposition 8 we obtain that Repairs<sup>T</sup> QL( E *X.*A*,* R) contains all optimal repairs (up to equivalence). However, as in the case without a TBox, it may also contain non-optimal repairs [7]. To compute the set of optimal repairs, one thus needs to remove such non-optimal elements from Repairs<sup>T</sup> QL( E *X.*A*,* R). Since the entailment test required for this is NP-complete for QL = CQ and polynomial for QL = IQ, we obtain the following theorem.

**Theorem 9.** *There is a (deterministic) algorithm that computes the set of all optimal* QL*-repairs of* E *X.*A *for* R *w.r.t.* T *and runs in exponential time. If* QL = CQ*, then this algorithm needs access to an NP oracle, whereas no such oracle is required for* QL = IQ*.*

## **5 Optimized Repairs**

The construction of the canonical repair induced by a seed function described in the previous section usually introduces an exponential number of copies for the objects occurring in the saturated qABox. The following example demonstrates that this is not always necessary to obtain an optimal repair.

*Example 10.* Let T := ∅ and consider the repair request {( E *r.*(*A*1-*...* -*An*))(*a*)} for the qABox E {*x*}*.*{*r*(*a, x*)*, A*1(*x*)*,...,An*(*x*)}. There is only one repair seed function *s*, which assigns { E *r.*(*A*<sup>1</sup> - *...* - *An*)} to *a*. Both for the CQ and the IQ case, the canonical repair induced by *s* contains 2*<sup>n</sup>* copies of *x*, namely all the variables *yx,*<sup>K</sup> for K⊆{*A*1*,...,An*}. However, most of these copies are redundant. In fact, we will see below that there are optimal repairs equivalent to the canonical one that contain only linearly many variables in *n*, both for the CQ and the IQ case.

The idea is now to construct, for a given seed function, a set of variables that is a (hopefully small) subset of the set *Y* introduced in Definition 7, which is nevertheless sufficient to obtain a repair equivalent to the canonical one. Note, however, that in general an exponential blow-up cannot be avoided, as already shown in [5] for the case of EL instance stores. Throughout this section, we assume that QL, T , R, and E *X.*A satisfy the properties assumed in the previous section. In addition, we assume that the repair request R is *reduced*, i.e., every concept occurring in a concept assertion in R is reduced, and if R contains *C*(*a*) and *D*(*a*) for distinct concept descriptions *C, D*, then *C* <sup>∅</sup> *D*, and we further assume that each concept occurring in the TBox T is reduced. Before we can describe our construction of the set of relevant variables, we must introduce some notation and show an auxiliary result.

Given two sets of concept descriptions K and L, we say that L *covers* K (written K≤L) if each concept in K is subsumed by some concept in L.

Now, let *s* be a repair seed function and set E *Y.*B := rep<sup>T</sup> QL( E *X.*A*, s*). Recall that, according to Definition 7, a role assertion *r*(*yt,*K*, yu,*L) belongs to the matrix B iff the saturation sat<sup>T</sup> QL( E *X.*A) contains the role assertion *r*(*t, u*) and the repair type L covers the set Succ(K*, r, u*) := { *C* | E *r.C* ∈ K and the matrix of sat<sup>T</sup> QL( E *X.*A) entails *C*(*u*) }*.*

If L does not satisfy this requirement, there might be another repair type L such that the canonical repair contains the assertion *r*(*yt,*K*, yu,*L- ), and thus our optimized repair needs to contain an appropriate variable to which *yu,*L can be mapped by a homomorphism or simulation. We generate such variables by looking for repair types M that cover both L and Succ(K*, r, u*). The set of all such repair types can effectively be computed, though it might be empty. For our purposes, it is sufficient to use only the ones that are minimal w.r.t. the cover relation ≤.

**Lemma 11.** *The set of all* ≤*-minimal repair types for u that cover* L ∪ Succ(K*, r, u*) *can be computed in exponential time.*

In general, this computation may produce exponentially many repair types, but this is not always the case. For instance, consider *a* = *ya,s*(*a*) and *yx,*<sup>∅</sup> in Example 10. We have Succ(*s*(*a*)*, r, x*) = {*A*1-*...*-*An*} and thus the assertion *r*(*a, yx,*∅) is not in B since ∅ clearly does not cover Succ(*s*(*a*)*, r, x*). The ≤-minimal repair types covering Succ(*s*(*a*)*, r, x*) are exactly the sets {*Ai*} for *i* = 1*,...,n*.

In the following, we construct a sequence *Y*0*, Y*1*,...,Y<sup>m</sup>* of subsets *Y<sup>i</sup>* of *Y* such that E *Y.*B is QL-equivalent to its sub-qABox E *Ym.*B*<sup>m</sup>* where B*<sup>m</sup>* contains only those assertions in B involving object names in *Σ*<sup>I</sup> ∪*Ym*. Recall that we use *ya,s*(*a*) as synonyms for the individuals *a* ∈ *Σ*I.

We start with the set *Y*0, which is empty if QL = IQ, and equal to the set { *yt,*<sup>∅</sup> | *t* is an object name occurring in sat<sup>T</sup> CQ( E *X.*A) } if QL = CQ.

The subsequent sets are obtained by exhaustively applying one of the following rules, depending on whether QL = CQ or QL = IQ.


The sets *Y<sup>i</sup>* are all subsets of the set *Y* of variables in the canonical repair. Since each rule application adds a variable, the exhaustive application of rules must terminate after finitely many steps with a set of variables *Y<sup>m</sup>* ⊆ *Y* .

Let us illustrate this construction using Example 10, first for the IQ case. We have *a* = *ya,s*(*a*) ∈ *Σ*<sup>I</sup> and the assertion *r*(*a, x*) belongs to the saturation, which is equal to the original qABox. As mentioned above, the ≤-minimal repair types covering Succ(*s*(*a*)*, r, x*) are exactly the sets {*Ai*} for *i* = 1*,...,n*. Thus, repeated applications of the IQ-construction rule add the variables *yx,*{*Ai*}, and the construction ends with *Y* IQ *<sup>m</sup>* = { *yx,*{*Ai*} | *i* = 1*,...,n* }. In the CQ case, the initial set of variables is *Y* CQ <sup>0</sup> = {*ya,*∅*, yx,*∅}. In this example, the CQ-construction rule then generates the same variables as the IQ rule, though this need not be the case in general. We end up with the final set *Y* IQ *<sup>m</sup>* <sup>∪</sup> *<sup>Y</sup>* CQ <sup>0</sup> .

**Definition 12.** *Let s be a repair seed function and Y<sup>m</sup>* ⊆ *Y be the set of variables obtained by an exhaustive application of the* QL*-construction rule. The* optimized QL-repair *of* E *X.*A *for* R *w.r.t.* T *induced by s, denoted by* orep<sup>T</sup> QL( E *X.*A*, s*)*, is the qABox* E *Ym.*B*<sup>m</sup> where the matrix* B*<sup>m</sup> contains all assertions in* B *involving only object names in Σ*<sup>I</sup> ∪ *Ym.*

Note that, to compute B*m*, we need not compute the larger matrix B first. Instead, we just apply the definition of the matrix in Definition 7 to the object names in *Σ*<sup>I</sup> ∪ *Ym*.

In our example, the optimized IQ-repair is the qABox E *Y* IQ *<sup>m</sup> .*B*<sup>m</sup>* with

$$\mathcal{B}\_m = \left\{ r(a, y\_{x, \{A\_i\}}) \mid 1 \le i \le n \right\} \cup \left\{ A\_j(y\_{x, \{A\_i\}}) \mid j \ne i \text{ and } 1 \le i, j \le n \right\}.$$

In the optimized CQ-repair, the quantifier prefix additionally contains the variables *ya,*<sup>∅</sup> and *yx,*∅, and the matrix additionally contains the assertions *r*(*ya,*∅*, yx,*∅) and *Ai*(*yx,*∅) for *i* = 1*,...,n*. Note that, without these assertions, the positive answer to the Boolean conjunctive query E *y, z.*(*r*(*y, z*) ∧ *A*1(*z*) ∧ *...* ∧ *An*(*z*)) would be lost.

Coming back to the general case, we first observe that the canonical QLrepair induced by *s* QL-entails the optimized QL-repair induced by *s* due to the inclusion relationship between these two qABoxes. The entailment in the other direction also holds, but this is harder to show, in particular for QL = CQ.

**Proposition 13.** *For each repair seed function s, the optimized* QL*-repair induced by s* QL*-entails the canonical* QL*-repair induced by s.*

*Proof sketch.* For QL = IQ, the proposition can be proved by showing that the following relation S is a simulation from E *Y.*B to E *Ym.*B*m*:

$$\mathfrak{S} := \{ (y\_{t,\mathcal{K}}, y\_{t,\mathcal{K'}}) \mid y\_{t,\mathcal{K}} \in \Sigma\_{\mathcal{O}}(\exists Y. \mathcal{B}), \ y\_{t,\mathcal{K'}} \in \Sigma\_{\mathcal{O}}(\exists Y\_m. \mathcal{B}\_m), \ \text{and } \mathcal{K'} \le \mathcal{K} \}.$$

For QL = CQ, we introduce a sequence of mappings *h*0*, h*1*,...,h<sup>n</sup>* : *Σ*O( E *Y.*B) → *Σ*O( E *Ym.*B*m*), starting with *h*0(*yt,*K) = *yt,s*(*t*) if *t* ∈ *Σ*<sup>I</sup> and *s*(*t*) ≤ K and *h*0(*yt,*K) = *yt,*<sup>∅</sup> otherwise. The initial mapping *h*<sup>0</sup> need not be a homomorphism since role assertions may not be preserved. In the step-wise construction of the mappings *h<sup>i</sup>* such defects are corrected, one by one. We can show that this construction always terminates after finitely many steps, yielding a homomorphism *h<sup>n</sup>* from E *Y.*B to E *Ym.*B*m*. -

Summing up, we have thus shown the following theorem, which implies that the optimized repairs also satisfy the properties stated in Proposition 8.

**Theorem 14.** *For each repair seed function s, the canonical* QL*-repair induced by s and the optimized* QL*-repair induced by s are* QL*-equivalent.*

#### **6 Evaluation**

To find out whether the repair approaches introduced in this paper are in principle viable for non-trivial ontologies, we made experiments for both IQ and CQrepairs with a first, rather unoptimized implementation. In addition to checking how often the implementation was able to compute a repair within a certain timeout, we also compared the sizes of optimized repairs with those of canonical repairs. We considered two different repair scenarios: repairing a single unwanted consequence for a single individual (S1), and repairing a single unwanted consequence for 10% of the individuals occurring in the ABox (S2). We report here the main results—more details and discussions can be found in [4].

As corpus for our evaluation, we chose the ontologies used in the 2015 OWL Reasoner Competition for the track OWL EL Realisation [28], since they contain a substantial amount of ABox assertions. These 109 ontologies were converted into pure EL by applying standard transformations and afterwards filtering out unsupported axioms. From these ontologies, we kept those that had at most 100,000 axioms in total. The resulting corpus contained 80 ontologies.

We implemented our methods in Java, using the OWL-API<sup>1</sup> for parsing OWL ontologies, and ELK [22] for precomputing any subsumption relationships entailed with and without the TBox potentially relevant for our repair approach. The code is available online.<sup>2</sup> All experiments were performed on an Intel(R) Core(TM) i5-4590 CPU with 4 cores and 32 GB RAM, of which we assigned 16 GB as maximal heap space to the Java VM.

Since it is a precondition of our repair approach, we first saturated the ontologies using the IQ-saturation rules of Figure 2, and the CQ-saturation rules of Figure 1. The CQ-saturation rules were implemented using the rule engine VLog [11] through the Java facade Rulewerk.<sup>3</sup> As CQ-saturation only terminates for cycle-restricted TBoxes, we only considered those ontologies for the CQ-saturation whose IQ-saturation did not introduce cycles between introduced variables. We used a timeout of 60 minutes for every saturation. This way, we successfully computed IQ-saturations of every ontology, and 62 CQ-saturations.

<sup>1</sup> http://owlapi.sourceforge.net

<sup>2</sup> https://github.com/de-tu-dresden-inf-lat/abox-repairs-wrt-static-tbox

<sup>3</sup> https://github.com/knowsys/rulewerk

The size of the saturated ABox was usually not much larger than that of the original one, and always less than two orders of magnitude larger. Interestingly, the successful CQ-saturations were rarely larger than the IQ-saturations, and often even of the same size, because no variables were added.

Scenario S1 was about repairing a single faulty entailment A |=<sup>T</sup> *C*(*a*). Since we did not have information about whether any entailments from the considered ontologies are faulty, we generated such assertions randomly. For this, we looked at entailments of the form A |=<sup>T</sup> *C*(*a*), where *C* ∈ Sub(T ). To make the repair requests more interesting, we furthermore required that *C* is not of the form *A* or E *r.*, where *A* is a concept name. This requirement already ruled out 54 of the IQ-saturated ontologies, and 44 of the CQ-saturated ontologies, as they did not have any complex entailments of the required form. For Scenario S2, we randomly selected some concept *C* ∈ Sub(T ) which had at least one instance (surprisingly, although *C* was not required to be complex, this ruled out 12 ontologies, including 4 of the CQ-saturated ones), together with a random selection of 10% of the individuals in A, and built the repair request consisting of all assertions *C*(*a*) where *a* ranges over the selected individuals. For both scenarios, we selected a random seed function for the obtained repair request.

For each ontology, scenario, and QL ∈ {IQ*,* CQ}, we attempted to compute optimised QL-repairs for 50 different repair requests. We also tried to compute the set of objects that would be included in the canonical repairs, to get an idea of the impact of our optimisation. For each such repair computation, we used a timeout of 10 minutes. Since all repair requests used only concept descriptions that were already in the input ontology, the number of objects in the canonical repair was independent of the repair request. We thus performed the latter computation only once for each ontology. The success rates were as follows:


This shows that the optimizations introduced in Section 5 have a very positive impact on the viability of our repair approach.

Fig. 3 gives more information on the number of objects and assertions in the computed repairs. On the left, we consider canonical and optimised IQ-repairs for scenario S2: specifically, we look at the difference in numbers of individuals occurring in the repair compared to the input ABox. In the middle and on the right, we visualise the difference between the number of assertions in the optimized IQ- and CQ-repairs, compared to the input ABoxes, for the scenarios S1 and S2, respectively. By construction, CQ-repairs cannot contain less assertions than the input ontologies. Sometimes the CQ-repairs were smaller than the corresponding IQ-repairs, which is due to the different saturation methods: variables introduced by the IQ-saturation could be connected to more individuals than for the CQ-saturation.

Fig. 3: Evaluation results. On the left, we show the difference of the number of object names in the canonical IQ-repairs (purple triangle) with the same difference, but restricted to objects occurring in assertions, for the optimised IQ-repairs (red circle) for S2. The other two graphs consider optimised IQ- and CQ-repairs for S1 and S2. In each graph, the x-axis shows the number of assertions in the input ontology, and the y-axis the observed difference.

## **7 Conclusion**

This paper presents approaches for repairing DL-based ontologies, in the sense that they allow to get rid of unwanted consequences. In contrast to most of the other work on ontology repair, our goal is to compute *optimal* repairs, i.e., ones that lose the least amount of other consequences. As relevant consequences to be preserved, we consider both answers to conjunctive queries (CQ) and answers to EL instance queries (IQ). The presented results improve on our previous work in this direction in two respects. First, we allow for the presence of a TBox, which is assumed to be static (i.e., cannot be changed by the repair), whereas before we assumed that the TBox is empty. Second, we develop a more efficient construction of optimal repairs, which is exponential only in the worst case. Our experimental results show that this optimization makes our repair approach viable also for fairly large ontologies, at least for the IQ case.

One question for future research is how to lift the restriction to cyclerestricted TBoxes in the CQ case. Since optimal repairs need not longer exist then, one can ask whether the existence question is decidable, and how to compute optimal repairs if they exist. We have already noticed in our first attempts to tackle this problem that optimal repairs may then become larger than single-exponential.

In this and in our previous work, we have assumed that unwanted consequences are specified as EL instance relationships. Another interesting open question is whether our results can be generalized to a setting where unwanted consequences are specified as answers to conjunctive queries, as e.g. in [14].<sup>4</sup>

<sup>4</sup> Note that no TBox is considered in [14], and the notion of optimality used there is different from ours (see the introduction of [7] for a discussion of the differences).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Generalized Completeness for SOS Resolution and its Application to a New Notion of Relevance**

Fajar Haifani <sup>1</sup>,2, Sophie Tourret <sup>1</sup>,3, and Christoph Weidenbach <sup>1</sup>

<sup>1</sup> Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbr¨ucken Germany

<sup>2</sup> Graduate School of Computer Science, Saarbr¨ucken, Germany

<sup>3</sup> Universit´e de Lorraine, CNRS, Inria, LORIA, Nancy, France

**Abstract.** We prove the SOS strategy for first-order resolution to be refutationally complete on a clause set N and set-of-support S if and only if there exists a clause in S that occurs in a resolution refutation from N ∪ S. This strictly generalizes and sharpens the original completeness result requiring N to be satisfiable. The generalized SOS completeness result supports automated reasoning on a new notion of relevance aiming at capturing the support of a clause in the refutation of a clause set. A clause C is *relevant* for refuting a clause set N if C occurs in every refutation of N. The clause C is *semi-relevant*, if it occurs in some refutation, i.e., if there exists an SOS refutation with set-of-support S = {C} from N \{C}. A clause that does not occur in any refutation from N is *irrelevant*, i.e., it is not semi-relevant. Our new notion of relevance separates clauses in a proof that are ultimately needed from clauses that may be replaced by different clauses. In this way it provides insights towards proof explanation in refutations beyond existing notions such as that of an unsatisfiable core.

# **1 Introduction**

Shortly after the invention of first-order resolution [14] its first complete refinement was established: set-of-support (SOS) resolution [18]. The idea of the SOS strategy is to split a current clause set into two sets, namely N and S and restrict resolution inferences to have one parent from the set-of-support <sup>S</sup>. Wos et al. [18] proved the SOS strategy complete if N is satisfiable. The motivation by Wos et. al. for the SOS strategy was getting rid of "irrelevant" inferences. If <sup>N</sup> defines a theory and <sup>S</sup> contains the negation of a conjecture (goal) to be refuted, the strategy puts emphasis on resolution inferences with the conjecture. This can be beneficial, because resolution is deductively complete (modulo subsumption) [11, 13], i.e., resolution inferences solely performed on clauses from N will enumerate *all* semantic consequences, not necessarily only consequences that turn out to be useful in refuting N <sup>∪</sup> S. Even in more restrictive contexts, the SOS strategy can be shown complete, e.g., if <sup>N</sup> is saturated by superposition and does not contain the empty clause, then the SOS strategy is also complete

in the context of the strong superposition inference restrictions on N and a set-of-support S [2].

In this paper, we generalize and sharpen the original completeness result for the SOS strategy: The resolution calculus with the SOS strategy is complete if and only if there is at least one clause in S that is contained in a resolution refutation from <sup>N</sup> <sup>∪</sup> <sup>S</sup>, Theorem 11. The proof is performed via proof transformation. Any (non SOS) refutation from N <sup>∪</sup> S can be transformed into an SOS refutation with SOS S, if the original refutation contains at least one clause from S.

The generalized SOS completeness result supports our new notion of *relevance* that is meant to be a first stop towards explaining the gist of a refutation. A clause C <sup>∈</sup> N is *relevant* if it is needed for any refutation of the clause set N. The clause C is *semi-relevant* if there is a refutation from N using C and C is *irrelevant* otherwise, Definition 12. Applying our generalized SOS completeness result, a clause C <sup>∈</sup> N is semi-relevant if and only if there is an SOS refutation from N \ {C} with SOS {C}.

The interest in semi-relevant clauses comes from real-world applications. In an industrial scenario where different products are built out of a building set, the overall product portfolio is often defined by a set of clauses (rules). Roughly, every clause describes the integration of some part out of the building set in a product. Different proofs for the existence of some product correspond to different builds of the product. For example, answering a question like "Can we build car x with part y?" from the automotive world boils down to the semi-relevance of the clauses defining part <sup>y</sup> in a refutation showing the existance of a car <sup>x</sup>. All German car manufacturers maintain such clause sets defining their product portfolio [6, 17].

Our new notion of relevance is related to other notions capturing aspects of a refutation. A minimal unsatisfiable core of an unsatisfiable clause set contains only semi-relevant clauses. The intersection of all minimal unsatisfiable cores is the set of relevant clauses. The notion of a minimal unsatisfiable core does not provide a test for semi-relevance of a specific clause. There are various notions from the description logic community related to unsatisfiable cores of a translation to first-order and/or to our notion of relevance [1,4,8,16]. An in-depth discussion of these relationships can be found in our description logic workshop paper [7]. The notion of relevant clauses is also related to what has been studied in the field of propositional satisfiability under the name of *lean kernels* [9, 10]: Given an unsatisfiable set N of propositional clauses, the lean kernel consists exactly of those clauses that are involved in at least one refutation proof of <sup>N</sup> in the resolution calculus, and thus, in our terminology, the set of semi-relevant clauses. A different notion of relevance was previously defined in the context of propositional abduction [5]. The authors provide algorithms and complexity results for various abduction settings in the propositional logic context. In addition to the fact that our notion of relevance is defined with respect to first-order clauses, in their context of propositional abduction, if a propositional variable is relevant, it must be satisfiability preserving when added to the theory (clause set). In our case, if

a clause C <sup>∈</sup> N is (semi-)relevant, then N is unsatisfiable and N \ {C} may be unsatisfiable as well.

The paper is organized as follows. After fixing some notations and notions at the beginning of Section 2 we introduce our proof transformation technique. First on an example, Figure 1, then in general. The following Section 3 proves important properties of the transformation, yielding our generalized completeness result for SOS, Theorem 11. We then link the SOS completeness result to our notion of semi-relevance in Section 4. The paper ends with a summary, a discussion of the contributions, and directions for future work, Section 5.

# **2 Resolution Proof Transformation**

After fixing some common notions and notation, this section introduces our proof transformation technique. First on an example and afterwards on resolution refutations in general.

We assume a first-order language without equality where N denotes a clause set; C, D denote clauses; L, K denote literals; A, B denote atoms; P, Q, R, T denote predicates; t, s terms; f, g, h functions; a, b, c constants; and x, y, z variables, all possibly indexed. Atoms, literals, clauses and clause sets are considered as usual. Clauses are disjunctions of literals. The complement of a literal is denoted by the function comp. Semantic entailment |= considers variables in clauses to be universally quantified. Substitutions σ, τ are total mappings from variables to terms, where dom(σ) := {x <sup>|</sup> xσ <sup>=</sup> x} is finite and codom(σ) := {t <sup>|</sup> xσ <sup>=</sup> t, x <sup>∈</sup> dom(σ)}. A *renaming* σ is a bijective substitution. The application of substitutions is extended to literals, clauses, and sets/sequences of such objects in the usual way. The function mgu denotes the *most general unifier* of two terms, atoms, literals if it exists. We assume that any mgu of two terms or literals does not introduce any fresh variables and is idempotent.

The resolution calculus consists of two inference rules: Resolution and Factoring [14, 15]. The rules operate on a state (N,S) where the initial state for a classical resolution refutation from a clause set <sup>N</sup> is (∅, N) and for an SOS refutation with clause set <sup>N</sup> and initial SOS <sup>S</sup> the initial state is (N,S). We describe the rules in the form of abstract rewrite rules operating on states (N,S). As usual we assume for the resolution rule that the involved clauses are variable disjoint. This can always be achieved by applying renamings to fresh variables.

**Resolution** (N,S {<sup>C</sup> <sup>∨</sup> <sup>K</sup>}) <sup>⇒</sup>RES (N,S ∪ {<sup>C</sup> <sup>∨</sup> K,(<sup>D</sup> <sup>∨</sup> <sup>C</sup>)σ}) provided (D <sup>∨</sup> L) <sup>∈</sup> (N <sup>∪</sup> S) and σ = mgu(L, comp(K))

**Factoring** (N,S {<sup>C</sup> <sup>∨</sup> <sup>L</sup> <sup>∨</sup> <sup>K</sup>}) <sup>⇒</sup>RES (N,S ∪ {<sup>C</sup> <sup>∨</sup> <sup>L</sup> <sup>∨</sup> <sup>K</sup>}∪{(<sup>C</sup> <sup>∨</sup> <sup>L</sup>)σ}) provided σ = mgu(L, K)

The clause (D <sup>∨</sup> C)σ is called the result of a *Resolution inference* between its parents. The clause (C <sup>∨</sup> L)σ is called the result of a *Factoring inference* of its parent. A sequence of rule applications (N,S) <sup>⇒</sup><sup>∗</sup> RES (N,S ) is called a *resolution derivation*. It is called an *SOS resolution derivation* if N <sup>=</sup> <sup>∅</sup>. In case ⊥ ∈ S it is a called a *(SOS) resolution refutation*.

**Theorem 1 (Soundness and Refutational Completeness of (SOS) Resolution [14, 18]).** *Resolution is sound and refutationally complete [14]. If for some clause set* N *and initial SOS* S*,* N *is satisfiable and* N <sup>∪</sup> S *is unsatisfiable, then there is a derivation of* <sup>⊥</sup> *from* (N,S) *[18].*

Where a resolution derivation (N,S) <sup>⇒</sup><sup>∗</sup> RES (N,S ) shows how new clauses can be derived from (N,S), a deduction presents the minimal derivation of a single clause, e.g., the empty clause ⊥ in case of a refutation. For deductions we require every clause to be used exactly once, so deductions always have a tree form. This is a purely technical restriction, see Corollary 5, that facilitates our deduction transformation technique that then needs not to take care of variable renamings except for input clauses.

**Definition 2 (Deduction).** *<sup>A</sup>* deduction <sup>π</sup><sup>N</sup> = [C1,...,C<sup>n</sup>] *of a clause* <sup>C</sup><sup>n</sup> *from some clause set* <sup>N</sup> *is a finite sequence of clauses such that for each* <sup>C</sup><sup>i</sup> *the following holds:*


*and for each* <sup>C</sup><sup>i</sup> <sup>∈</sup> <sup>π</sup><sup>N</sup> *,* i<n*:*


*We omit the subscript* <sup>N</sup> *in* <sup>π</sup><sup>N</sup> *if the context is clear.*

A deduction π of some clause C <sup>∈</sup> π, where π, π are deductions from N is a subdeduction of π if π <sup>⊆</sup> π, where for the latter subset relation we identify sequences with multisets. A deduction <sup>π</sup><sup>N</sup> = [C<sup>1</sup>,...,C<sup>n</sup>−<sup>1</sup>, <sup>⊥</sup>] is called a *refutation*.

Note that variable renamings are only applied to clauses from N such that all clauses from N that are introduced in the deduction are variable disjoint.

**Definition 3 (SOS Deduction).** *A deduction* <sup>π</sup><sup>N</sup>∪<sup>S</sup> = [C<sup>1</sup>,...,C<sup>n</sup>] *is called an* SOS deduction *if the derivation* (N,S<sup>0</sup>) <sup>⇒</sup><sup>∗</sup> RES (N,S<sup>m</sup>) *is an SOS derivation where* <sup>C</sup> <sup>1</sup>,...,C <sup>m</sup> *is the subsequence from* [C<sup>1</sup>,...,C<sup>n</sup>] *with input clauses removed,* <sup>S</sup><sup>0</sup> <sup>=</sup> <sup>S</sup>*, and* <sup>S</sup><sup>i</sup>+1 <sup>=</sup> <sup>S</sup><sup>i</sup> <sup>∪</sup> <sup>C</sup> <sup>i</sup>+1*.*

**Definition 4 (Overall Substitution of a Deduction).** *Given a deduction* π *of a clause* <sup>C</sup><sup>n</sup> *the* overall substitution <sup>τ</sup>π,i *of* <sup>C</sup><sup>i</sup> <sup>∈</sup> <sup>π</sup> *is recursively defined by*

*1 if* <sup>C</sup><sup>i</sup> *is a factor of* <sup>C</sup><sup>j</sup> *with* j<i *and mgu* <sup>σ</sup>*, then* <sup>τ</sup>π,i <sup>=</sup> <sup>τ</sup>π,j ◦ <sup>σ</sup>*,*


*and the overall substitution of the deduction is* <sup>τ</sup><sup>π</sup> <sup>=</sup> <sup>τ</sup>π,n*. We omit the subscript* π *if the context is clear.*

Overall substitutions are well-defined, because clauses introduced from N into the deduction are variable disjoint and each clause is used exactly once in the deduction. A grounding of an overall substitution τ of some deduction π is a substitution τ δ such that codom(τ δ) only contains ground terms and dom(δ) is exactly the variables from codom(τ ).

**Corollary 5 (Deduction Refutations versus Resolution Refutations).** *There exists a resolution refutation* (N,S) <sup>⇒</sup><sup>∗</sup> RES (N,S ∪ {⊥}) *if and only if there exists a deduction refutation* <sup>π</sup>(N∪S) = [C1,...,C<sup>n</sup>−1, <sup>⊥</sup>] *where* <sup>C</sup><sup>i</sup> <sup>∈</sup> (N∪S ) *for all* i*, modulo variable renaming.*

We prove the generalized completeness result of SOS by transforming non-SOS refutations into SOS refutations. For illustration of our proof transformation technique, consider the below unsatisfiable set of clauses N. Literals are labeled in N by a singleton set of a unique natural number [12]. We will refer to the literal labels during proof transformation in order to identify resolution and factorization steps. The labels are inherited in a resolution inference and united for the factorized literal in a factoring inference. See the factoring inference on clause (3), Figure 1.

$$\begin{aligned} N &= \{ (1) \colon \{1\} \neg Q(x\_3, f(a)) \lor \{2\} P(f(a)), & \text{(2)} \land \{3\} \neg P(x\_4) \lor \{4\} \neg Q(b, x\_4), \\ &\text{(5)} \colon \{5\} \neg Q(b, a) \lor \{6\} Q(x\_1, f(x\_6)), \\ &\text{(6)} \colon \{7\} Q(b, x\_2) \lor \{8\} R(x\_2) \lor \{9\} T(c, x\_1), \\ &\text{(9)} \colon \{10\} \neg R(x\_5), & \text{(11)} \colon \{11\} \neg T(c, b) \} \end{aligned}$$

Figure 1 shows a resolution refutation

π = [(5),(6),(7),(1),(2),(3),(4),(8),(9),(10),(11),(12)]

from N. This resolution refutation is also an SOS refutation with SOS S <sup>=</sup> {(2),(5)} and remaining clause set N \ S. It is not an SOS refutation with SOS S <sup>=</sup> {(5)} and the remaining clause set N \S because the resolution step between clauses (1) and (2) is not an SOS step. The shaded part of the tree belongs to an SOS deduction with S <sup>=</sup> {(5)}.

The transformation identifies a clause closest to the leaves of the tree, obtained by resolution, that has one parent that can be derived by the SOS strategy, but the other parent is not in the SOS nor an input clause. For our example with starting SOS S <sup>=</sup> {(5)} this is clause (8). The parent (7) can be derived via SOS from S but the other parent (4) is not part of an SOS derivation. The overall grounding substitution of <sup>π</sup> is <sup>τ</sup> <sup>=</sup> {x<sup>1</sup> <sup>→</sup> b, x<sup>2</sup> <sup>→</sup> a, x<sup>3</sup> <sup>→</sup> b, x<sup>4</sup> <sup>→</sup> <sup>f</sup>(a), x<sup>5</sup> <sup>→</sup> a, x<sup>6</sup> <sup>→</sup> <sup>a</sup>}. Now the idea of a single transformation step is to perform the

**Fig. 1.** Refutation of <sup>π</sup> of <sup>N</sup>

resolution step on the labelled literal {1, <sup>4</sup>}¬Q(b,f(a)) and the respective literal {6}Q(x1,f(x6)) of the SOS derivable clause (7) already on the respective literals from the input clauses yielding (8), here clauses (1) and (2). To this end the derivation [(5),(6),(7)] is copied with fresh variables, see Figure 2, yielding the clauses (7) and (7 ) used in the refutation π below, see also Figure 3.

$$\overbrace{\overbrace{\begin{pmatrix} (5) \vdots \{5\} \neg Q(b,a) \lor \{6\} Q(x\tau, f(x\_{9})) \\ \hline \\ \end{pmatrix}}^{\{6\} \cdot \{6\} \cdot \{7\} Q(b,x\_{8}) \lor \{8\} R(x\_{8}) \lor \{9\} T(c,x\tau) \\ \hline \\ \{7\} \cdot \{6\} Q(x\_{7},f(x\_{9})) \lor \{8\} R(a) \lor \{9\} T(c,x\_{7}) \\ \hline \\ \end{pmatrix}}^{\{5\}}$$

**Fig. 2.** The copied subdeductions deriving (7)

The two freshly renamed copies (7) and (7 ) are resolved with the respective input clauses (1) and (2). Finally, the rest of the deduction yielding clause (8) is simulated with the resolved input clauses, see Figure 3. Now (8) is exactly clause (8) from the original deduction π, but (8) is derived by an SOS deduction. The deduction can then be continued the same way it was done in π and in this case will already yield an SOS refutation.

 $\pi' = [(5), (6), (7), (5'), (6'), (7'), (1), (1'), (2), (2'), (8''), (8''), (8'''), (9''), (9''), (9''), (9''), (9''), (9''), (9''), (10), (10), (11), (12)]. $  $\text{mle mitivates our use of literal labels. Firstly the tell us which}$ 

The example motivates our use of literal labels. Firstly, they tell us which literals from input clauses need to be resolved: here the literals {1}¬Q(x<sup>3</sup>,f(a)) and {4}¬Q(b,x<sup>4</sup>) that are factorized in π to {1, <sup>4</sup>}¬Q(b,f(a)). Secondly, they guide additional factoring steps in π during the simulation of the non-SOS part from π: here the factoring between the two literals labelled {8} in clause (8 ) and

**Fig. 3.** The new SOS deduction yielding a copy of clause (8)

the two literals with label {9} in clause (8). The transformation always works because the overall grounding substitution of the initial refutation π is preserved by the transformation. It just needs to be extended to the extra variables added by freshly renamed copies of clauses.

The above example shows the importance of keeping track of the occurrences of literals in a deduction. A *labeled literal* is a pair ML where M is a finite non-empty set of natural numbers called the *label* and L is a literal. We identify literals with labeled literals and refer explicitly to the label of a labeled literal by the function lb. The function lb is extended to clauses via union of the respective literal labels. We extend the notion of a clause to that of a labeled clause built on labeled literals in the straightforward way. We call a deduction <sup>π</sup><sup>N</sup> *label-disjoint* if the clauses from N in the deduction have unique singleton labels. Labels are inherited in a deduction as follows: in case of a resolution inference, the labels of the parent clauses are inherited and in case of the factoring inference, the label of the remaining literal is the union of labels of the factorized literals.

In general, we need to identify the parts of a deduction that are already contained in an SOS deduction, this is called the *partial* SOS of a deduction, Definition 6. Then this information can be used to perform the above transformation on any deduction π.

**Definition 6 (PSOS of a Deduction).** *Let* <sup>π</sup> *be a deduction from* <sup>N</sup> <sup>S</sup>*, then the* partial SOS *(PSOS)* O<sup>∗</sup> *of* π,N, S *is defined as* O<sup>∗</sup> <sup>=</sup> m <sup>i</sup>=0 <sup>O</sup><sup>i</sup> *, where* <sup>O</sup><sup>0</sup> <sup>=</sup> <sup>S</sup>*,* <sup>O</sup><sup>i</sup>+1 <sup>=</sup> <sup>O</sup><sup>i</sup> ∪ {C<sup>j</sup>} *provided* <sup>C</sup><sup>j</sup> <sup>∈</sup> <sup>π</sup>*,* <sup>C</sup><sup>j</sup> ∈ <sup>O</sup><sup>i</sup> *and* <sup>C</sup><sup>j</sup> *is either the factor of some clause in* <sup>O</sup><sup>i</sup> *or the resolvent of two clauses in* <sup>π</sup> *where at least one parent is from* O<sup>i</sup> *, and where* <sup>O</sup><sup>m</sup> *is such that there is no longer such a* <sup>C</sup><sup>j</sup> *in* <sup>π</sup>*.*

The partial SOS is well-defined because the resulting O<sup>∗</sup> is independent of the sequence <sup>O</sup><sup>i</sup> used. For example, for the deduction <sup>π</sup> from <sup>N</sup> presented in Figure <sup>1</sup> the set O<sup>∗</sup> <sup>=</sup> {(5),(6),(7)} is the PSOS of π,N, {5}. Next we present a criterion when the PSOS of a deduction actually signals an SOS deduction.

**Lemma 7 (SOS Deduction).** *Let* O<sup>∗</sup> *be the PSOS of* π,N, S*. Then* π *is an* SOS deduction *if* O<sup>∗</sup> \ S <sup>=</sup> π \ (N <sup>∪</sup> S)<sup>4</sup>*, i.e., all inferred clauses in* π *are contained in* O∗*.*

*Proof.* Let <sup>π</sup>N∪<sup>S</sup> = [C1,...,C<sup>n</sup>] and [C <sup>1</sup>,...,C <sup>m</sup>] be the subsequence of <sup>π</sup>N∪<sup>S</sup> with input clauses removed. Let O<sup>∗</sup> be the PSOS of π,N, S. Then [C <sup>1</sup>,...,C <sup>m</sup>] = O<sup>∗</sup> \ S <sup>=</sup> π \ (N <sup>∪</sup> S) by assumption. We show that (N,S<sup>0</sup>) <sup>⇒</sup><sup>∗</sup> RES (N,S<sup>m</sup>) is an SOS derivation, following Definition <sup>3</sup> by induction on <sup>m</sup>. If <sup>m</sup> = 0 then <sup>π</sup> only consists of input clauses and there is nothing to show. For the case m = 1, the clause C <sup>1</sup> is the result of a factoring inference from <sup>S</sup> or the result of a resolution inference from N <sup>∪</sup> S such that at least one parent is in S as for otherwise C <sup>1</sup> ∈ (O<sup>∗</sup> \ <sup>S</sup>). So (N,S0) <sup>⇒</sup><sup>∗</sup> RES (N,S<sup>0</sup> ∪ {C <sup>1</sup>}) is an SOS derivation. For the induction case, assume the property holds for i. If C <sup>i</sup>+1 is the result of a factoring inference, then its parent <sup>C</sup> is contained in <sup>S</sup><sup>i</sup> because otherwise C <sup>∈</sup> N because π being a deduction, and, therefore C <sup>i</sup>+1 ∈ (O<sup>∗</sup> \ <sup>S</sup>), a contradiction. If <sup>C</sup> <sup>i</sup>+1 is the result of a resolution inference, then again all its parents are contained in N <sup>∪</sup> S<sup>i</sup> because π is a deduction. If both parents are from N, then C <sup>i</sup>+1 ∈ (O<sup>∗</sup> \ <sup>S</sup>), a contradiction. So, by the induction hypothesis, (N,S0) <sup>⇒</sup><sup>∗</sup> RES (N,S<sup>i</sup> ) <sup>⇒</sup>RES (N,S<sup>i</sup>+1) is an SOS derivation.

The rest of this section is devoted to describing the transformation in detail. In the next section, we then prove the new completeness result for SOS.

Let <sup>π</sup> be a label-disjoint deduction from <sup>N</sup> <sup>∪</sup> <sup>S</sup> and let <sup>C</sup><sup>k</sup> <sup>∈</sup> <sup>π</sup> be a clause of minimal index such that <sup>C</sup><sup>k</sup> is the result of a resolution inference from clauses <sup>C</sup><sup>j</sup> <sup>∈</sup> <sup>O</sup><sup>∗</sup> and <sup>C</sup><sup>i</sup> ∈ (<sup>N</sup> <sup>∪</sup> <sup>O</sup><sup>∗</sup>). Let <sup>τ</sup> be an overall ground substitution for <sup>π</sup>. We transform <sup>π</sup> into <sup>π</sup> by changing the deduction of <sup>C</sup><sup>i</sup> such that the overall deduction gets "closer" to an SOS derivation and preserves τ . Let

$$\begin{array}{l} C\_j = C\_j' \lor L \\ C\_i = C\_i' \lor K \\ C\_k = (C\_i' \lor C\_j')\sigma \end{array} \tag{1}$$

where σ = mgu(K, comp(L)). Without loss of generality we assume that

$$\pi = [C\_1, \dots, C\_i, C\_{i+1}, \dots, C\_j, C\_k, C\_{k+1}, \dots, C\_n] \tag{2}$$

where [C<sup>1</sup>,...,C<sup>i</sup>] and [C<sup>i</sup>+1,...,C<sup>j</sup> ] are subdeductions of <sup>π</sup>, and the prefixes of these sequences are exactly the introduced renamed copies of input clauses from <sup>N</sup> that are used to derive <sup>C</sup><sup>i</sup> and <sup>C</sup><sup>j</sup> , respectively. The transformed derivation will be

$$\pi' = [C\_{i+1}^1, \dots, C\_j^1, \dots, C\_{i+1}^m, \dots, C\_j^m, D\_1, \dots, D\_l, C\_{k+1}', \dots, C\_n'] \tag{3}$$

where

<sup>4</sup> Here we refer to the removal of all input clauses from O<sup>∗</sup> and π, respectively.

	- (i) if <sup>C</sup><sup>p</sup> is an input clause not containing a literal <sup>K</sup> with lb(K ) <sup>⊆</sup> lb(K), then <sup>D</sup><sup>q</sup>+1 <sup>=</sup> <sup>C</sup><sup>p</sup> and we associate <sup>D</sup><sup>q</sup>+1 with <sup>C</sup><sup>p</sup>;
	- (ii) if <sup>C</sup><sup>p</sup> is an input clause containing a literal <sup>K</sup> with lb(K ) <sup>⊆</sup> lb(K), then <sup>D</sup><sup>q</sup>+1 <sup>=</sup> <sup>C</sup><sup>p</sup> and <sup>D</sup><sup>q</sup>+2 is the resolvent between <sup>D</sup><sup>q</sup>+1 and a so far unused clause C<sup>o</sup> <sup>j</sup> on the literals <sup>K</sup> <sup>∈</sup> <sup>D</sup><sup>q</sup>+1 and <sup>L</sup> <sup>∈</sup> <sup>C</sup><sup>o</sup> <sup>j</sup> where lb(K ) <sup>⊆</sup> lb(K) and lb(L ) = lb(L) and we associate <sup>D</sup><sup>q</sup>+2 with <sup>C</sup><sup>p</sup>;
	- (iii) if <sup>C</sup><sup>p</sup> is the resolvent between two clauses <sup>C</sup><sup>i</sup>- , C<sup>j</sup> then we perform the respective resolution step between the associated clauses and respective associated literals from D<sup>q</sup>- , D<sup>q</sup>- yielding <sup>D</sup><sup>q</sup>+1 and associate <sup>D</sup><sup>q</sup>+1 with Cp;
	- (iv) if <sup>C</sup><sup>p</sup> is the factor on some literal <sup>K</sup> with lb(K ) <sup>⊆</sup> lb(K), then we perform the respective factoring steps <sup>D</sup><sup>q</sup>+1,...,D<sup>q</sup>+<sup>s</sup> for respective literals with labels from C <sup>j</sup> , where <sup>s</sup> <sup>=</sup> <sup>|</sup>C <sup>j</sup> <sup>|</sup> and we associate <sup>D</sup><sup>q</sup>+<sup>s</sup> with <sup>C</sup><sup>p</sup>,
	- (v) if <sup>C</sup><sup>p</sup> is the factor on some literal <sup>K</sup> with lb(K ) ⊆ lb(K), then we perform the respective factoring step on the respective literals with identical labels from clause D<sup>q</sup>yielding <sup>D</sup><sup>q</sup>+1 and we associate <sup>D</sup><sup>q</sup>+1 with <sup>C</sup><sup>p</sup>;

Note that by assumption, the generation of clauses <sup>C</sup><sup>k</sup>+1,...,C<sup>n</sup> does not depend on clauses <sup>C</sup><sup>1</sup>,...,C<sup>i</sup>, C<sup>i</sup>+1,...,C<sup>j</sup> but only on <sup>C</sup><sup>k</sup> and the input clauses. We will prove that C<sup>k</sup>τ <sup>=</sup> C<sup>k</sup>τ <sup>=</sup> D<sup>l</sup>τ which is then sufficient to prove C<sup>n</sup>τ <sup>=</sup> C<sup>n</sup>τ <sup>=</sup> C <sup>n</sup><sup>τ</sup> and for the above to be well-defined. In general, the clause <sup>D</sup><sup>l</sup> is not identical to <sup>C</sup><sup>k</sup> because we introduce fresh variables in <sup>π</sup> and do not make any specific assumptions on the unifiers used to derive D<sup>l</sup>.

Mapping the transformation to our running example, Figure 1: <sup>C</sup><sup>j</sup> = (7), <sup>C</sup><sup>i</sup> = (4), and <sup>C</sup><sup>k</sup> = (8). We need two copies of (7) because <sup>K</sup> <sup>=</sup> {1, <sup>4</sup>}¬Q(b, f(a)) so m <sup>=</sup> |{1, <sup>4</sup>}| = 2 and L <sup>=</sup> {6}Q(x<sup>1</sup>, f(x<sup>6</sup>)).

# **3 A Generalized Completeness Proof for SOS**

In this section, we prove that repeated applications of the transformation introduced in the previous section can actually transform an arbitrary deduction into an SOS deduction, given that at least one clause from the SOS occurs in the original deduction. Firstly, we show that associated clauses of the transformed deduction preserve main properties of the original deduction. The extended substitution is identical to the original substitution on old clauses and the changed part of the deduction ends in exactly the same clause.

**Lemma 8 (Properties of Associated Clauses).** *Let* <sup>C</sup><sup>j</sup> *,* <sup>C</sup><sup>i</sup>*,* <sup>C</sup><sup>k</sup>*,* <sup>L</sup>*,* <sup>K</sup>*,* <sup>π</sup>*,* π *,* τ *,* τ *be as defined in (1), (2), and (3), page 334. For each clause* C *out of* [C1,...,C<sup>i</sup>] *and clause* D *associated with* C*:*


*Proof.* 1. By definition of τ the additional variables in τ do not occur in C while τ is identical to <sup>τ</sup> on the variables of <sup>C</sup>, hence Cτ <sup>=</sup> Cτ .

2. By induction on the generation of π . For the base case, every literal occurring in N <sup>∪</sup> S has a unique label and any renamed clause C<sup>o</sup> <sup>m</sup> for some <sup>C</sup><sup>m</sup> <sup>∈</sup> (<sup>N</sup> <sup>∪</sup> <sup>S</sup>) has the labels kept. So, for any two literals <sup>K</sup> and <sup>L</sup> in any non inferred clauses in π and π , K τ <sup>=</sup> L τ when the labels are equal. For the induction step, for inferred clauses, lb(K ) = lb(L ) happens when the label of K is inherited from L through an inference. The inference uses an mgu which is compatible with τ due to τ being an overall ground substitution, so K τ <sup>=</sup> L τ .

3. We prove this property by induction on the length of the derivation [C<sup>1</sup>,...,C<sup>i</sup>]. Let <sup>C</sup> <sup>=</sup> <sup>C</sup><sup>p</sup>, 1 <sup>≤</sup> <sup>p</sup> <sup>≤</sup> <sup>i</sup>, and let <sup>D</sup><sup>1</sup>,...,D<sup>q</sup> be the clauses generated until <sup>C</sup><sup>p</sup>−<sup>1</sup> for which, by the induction hypothesis the property already holds.


lb(L i- ) = lb(L q-- ) and lb(L j- ) = lb(L q-- ) and none of these literals has a label from lb(K) or lb(C<sup>o</sup> <sup>j</sup> ). Hence, the conjecture holds by the induction hypothesis.

(iv) If <sup>C</sup> results from a factoring on <sup>K</sup> from <sup>C</sup>p−<sup>1</sup>, we get <sup>D</sup>q+<sup>s</sup> by a sequence of <sup>s</sup> factoring inferences from <sup>D</sup>q+1 associated with <sup>C</sup>p−<sup>1</sup>. Any factorings on <sup>C</sup>p−<sup>1</sup> and <sup>D</sup>q+1 do not change literal labels because we factorize literals of identical label. So, this property holds by the induction hypothesis. This holds regardless of whether lb(K ) <sup>⊆</sup> lb(K).

4. From Lemma 8.3 we know that lb(C) \ lb(K) = lb(D) \ lb(C<sup>o</sup> <sup>j</sup> ) and lb(C<sup>o</sup> <sup>j</sup> ) ⊆ lb(D) if there is K <sup>∈</sup> C with lb(K ) <sup>⊆</sup> lb(K). Since the labels coincide, using Lemma 8.2, we have Cτ \ {K <sup>∈</sup> Cτ <sup>|</sup> lb(K ) <sup>⊆</sup> lb(K)} <sup>=</sup> Dτ \ {L <sup>∈</sup> Dτ <sup>|</sup> lb(L ) <sup>∈</sup> lb(C<sup>o</sup> <sup>j</sup> )} and <sup>C</sup><sup>o</sup> <sup>j</sup> <sup>τ</sup> <sup>⊆</sup> Dτ if there is <sup>K</sup> <sup>∈</sup> <sup>C</sup> with lb(K ) <sup>⊆</sup> lb(K). This hypothesis holds by applying Lemma 8.1 on literals and clauses from π in the equation.

5. The clause <sup>C</sup><sup>k</sup> is the result of a resolution inference between <sup>C</sup><sup>i</sup> and <sup>C</sup><sup>j</sup> upon K and L: C<sup>k</sup>τ <sup>=</sup> C <sup>i</sup>τ <sup>∪</sup> C <sup>j</sup> <sup>τ</sup> . By translation and because {K <sup>∈</sup> <sup>C</sup><sup>i</sup> <sup>|</sup> lb(K ) ⊆ lb(K)} <sup>=</sup> {K}, the clause <sup>C</sup><sup>i</sup> is associated with <sup>D</sup><sup>l</sup> <sup>∈</sup> <sup>π</sup> and <sup>C</sup><sup>i</sup><sup>τ</sup> \ {Kτ} <sup>=</sup> D<sup>l</sup>τ \ {L <sup>∈</sup> D<sup>l</sup>τ <sup>|</sup> lb(L ) <sup>∈</sup> lb(C<sup>o</sup> <sup>j</sup> )}. Since <sup>C</sup><sup>o</sup> <sup>j</sup> <sup>τ</sup> <sup>=</sup> <sup>C</sup> <sup>j</sup> <sup>τ</sup> <sup>=</sup> <sup>C</sup><sup>j</sup> <sup>τ</sup> \ {Lτ}, we have {L <sup>∈</sup> D<sup>l</sup>τ <sup>|</sup> lb(L) <sup>⊆</sup> lb(L ) for some L <sup>∈</sup> C<sup>o</sup> <sup>j</sup> } <sup>=</sup> <sup>D</sup><sup>l</sup><sup>τ</sup> <sup>∩</sup>C<sup>o</sup> <sup>j</sup> \ {Lτ} <sup>=</sup> <sup>C</sup><sup>j</sup> \ {Lτ}. So <sup>C</sup><sup>i</sup> \ {Kτ} <sup>=</sup> <sup>D</sup><sup>l</sup><sup>τ</sup> \ (D<sup>l</sup><sup>τ</sup> <sup>∩</sup> <sup>C</sup><sup>j</sup> \ {Lτ}) = <sup>D</sup><sup>l</sup><sup>τ</sup> \ (C<sup>j</sup> \ {Lτ}). We can add <sup>C</sup><sup>j</sup> <sup>τ</sup> \ {Lτ} to both sides and get <sup>C</sup><sup>k</sup><sup>τ</sup> <sup>=</sup> <sup>C</sup><sup>i</sup><sup>τ</sup> <sup>∪</sup> <sup>C</sup><sup>j</sup> <sup>τ</sup> \ {Kτ, Lτ} ⊇ <sup>D</sup><sup>l</sup><sup>τ</sup> . In addition, since lb(K) <sup>⊆</sup> lb(K), this means <sup>C</sup><sup>j</sup> <sup>τ</sup> <sup>=</sup> <sup>C</sup><sup>o</sup> <sup>j</sup> <sup>τ</sup> <sup>⊆</sup> <sup>D</sup><sup>q</sup><sup>τ</sup> . Therefore <sup>C</sup><sup>k</sup><sup>τ</sup> <sup>=</sup> <sup>C</sup><sup>i</sup><sup>τ</sup> <sup>∪</sup> <sup>C</sup><sup>j</sup> <sup>τ</sup> \ {Kτ, Lτ} <sup>=</sup> <sup>D</sup><sup>l</sup><sup>τ</sup> .

Next we need a well-founded measure that decreases with every transformation step and in case of reaching its minimum signals an SOS deduction. Given a clause set N and an initial SOS S, the SOS measure of a deduction π is μ(π) where μ(π) = <sup>C</sup>i∈<sup>π</sup> <sup>μ</sup>(C<sup>i</sup>, π) and <sup>μ</sup>(C<sup>i</sup>, π) = 0 if <sup>C</sup><sup>i</sup> <sup>∈</sup> <sup>N</sup> <sup>∪</sup> <sup>O</sup><sup>∗</sup> otherwise <sup>μ</sup>(C<sup>i</sup>, π) = 1.

**Lemma 9 (Properties of** μ**).** *Given a clause set* <sup>N</sup>*, an initial SOS* <sup>S</sup>*, and a deduction* π *that contains at least one resolution step,*

*1.* μ(π) <sup>≥</sup> <sup>0</sup>*, and 2. if* μ(π)=0 *then* π *is an SOS deduction.*

*Proof.* 1. Obvious.

2. Towards contradiction, suppose π = [C<sup>1</sup>,...,C<sup>n</sup>] is not an SOS deduction. This means O<sup>∗</sup> \ S <sup>π</sup> \ (<sup>N</sup> <sup>∪</sup> <sup>S</sup>) by Lemma 7. Consider a clause <sup>C</sup><sup>i</sup> <sup>∈</sup> (<sup>π</sup> \ (<sup>N</sup> <sup>∪</sup> <sup>S</sup>)) \ (O<sup>∗</sup> \ <sup>S</sup>) of minimal index. Then <sup>C</sup><sup>i</sup> must be the result of an inference on some <sup>C</sup><sup>j</sup> and <sup>C</sup><sup>k</sup> such that both are not in <sup>O</sup><sup>∗</sup>. This means <sup>C</sup><sup>i</sup> ∈ (<sup>N</sup> <sup>∪</sup> <sup>O</sup><sup>∗</sup>). For this clause, μ assigns a nonzero value: μ(C<sup>i</sup>, π) > 0. Therefore, μ(π) = 0.

Next we combine the properties of associated clauses on one transformation step with the properties of the measure resulting in an overall deduction transformation that can be recursively applied and deduces the same clause modulo some grounding.

**Lemma 10 (Properties of the Transformation).** *Given a deduction* π *of a clause* <sup>C</sup><sup>n</sup> *from* <sup>N</sup> <sup>∪</sup><sup>S</sup> *that contains at least one resolution step such that* <sup>π</sup>∩<sup>S</sup> <sup>=</sup> <sup>∅</sup>*, an overall ground substitution* τ *of* π *and the transformed deduction* π *of a clause* C <sup>n</sup> *as defined in (1), (2), and (3) with overall ground substitution* <sup>τ</sup> *, we have:*


*Proof.* 1. We show that π is a deduction following Definition 2. These properties will be carried over from <sup>π</sup>. Observe that, if <sup>π</sup><sup>1</sup> is a deduction of <sup>C</sup><sup>k</sup> from <sup>N</sup> <sup>∪</sup> <sup>S</sup> and <sup>π</sup><sup>2</sup> is a deduction from <sup>N</sup> <sup>∪</sup><sup>S</sup> ∪ {C<sup>k</sup>} using <sup>C</sup><sup>k</sup> only once, their concatenation <sup>π</sup><sup>1</sup> ◦ <sup>π</sup><sup>2</sup> is a deduction from <sup>N</sup> <sup>∪</sup> <sup>S</sup>. Firstly, the subsequences [C<sup>o</sup> <sup>i</sup>+1,...,C<sup>o</sup> <sup>j</sup> ] are deductions of <sup>C</sup><sup>o</sup> <sup>j</sup> from <sup>N</sup> <sup>∪</sup> <sup>S</sup> since they are only the renamed copies of the subdeduction [C<sup>i</sup>+1,...C<sup>j</sup> ] of <sup>π</sup>. Secondly, the subsequence [C<sup>k</sup>,...,C<sup>n</sup>] is a deduction of <sup>C</sup><sup>n</sup> from <sup>N</sup> <sup>∪</sup> <sup>S</sup> ∪ {C<sup>k</sup>} since the clauses after <sup>C</sup><sup>k</sup> do not use any clauses before <sup>C</sup><sup>k</sup> by the way <sup>π</sup> is represented as a sequence. Now, by showing that [C<sup>1</sup> <sup>j</sup> ,...,C<sup>m</sup> <sup>j</sup> , D1,...,D<sup>l</sup>, C<sup>k</sup>] is a deduction of <sup>C</sup><sup>k</sup> from <sup>N</sup> <sup>∪</sup> <sup>S</sup> ∪ {C<sup>o</sup> <sup>j</sup> }<sup>o</sup>∈[1,m], the sequence [D1,...,D<sup>l</sup>] would then connect the initial copied sequences and the tailing subsequence. Each C<sup>o</sup> <sup>j</sup> is used for exactly one resolution inference producing some D<sup>q</sup>, the other required clauses are copied, and the later resolution and factoring steps in [D1,...,D<sup>l</sup>] are sound while the deduction properties of [C1,...,C<sup>i</sup>] are preserved in its associated clauses: for an inference where C<sup>p</sup>- (and C<sup>p</sup>-- ) generates C<sup>p</sup>, we have a unique inference between their associated clauses <sup>D</sup><sup>q</sup>- , (D<sup>q</sup>-- ,) <sup>D</sup><sup>q</sup>+1 where <sup>D</sup><sup>q</sup>- (and D<sup>q</sup>-- ) generates D<sup>q</sup>+1, possibly with additional factoring inferences in between. If <sup>C</sup><sup>p</sup> is an input clause not containing a literal <sup>K</sup> with lb(K ) <sup>⊆</sup> lb(K), then <sup>D</sup><sup>q</sup>+1 <sup>=</sup> <sup>C</sup><sup>p</sup> <sup>∈</sup> <sup>N</sup>. The clause <sup>D</sup><sup>q</sup>+1 is used in <sup>π</sup> as <sup>C</sup><sup>p</sup> is used in <sup>π</sup>; if <sup>C</sup><sup>p</sup> is an input clause containing a literal <sup>K</sup> with lb(K ) <sup>⊆</sup> lb(K), the resolution between <sup>D</sup><sup>q</sup>+1 and a so far unused clause <sup>C</sup><sup>o</sup> <sup>j</sup> is sound as K and comp(L ) are unifiable by τ . Here, all C<sup>o</sup> <sup>j</sup> will be eventually used as there are <sup>m</sup> <sup>=</sup> <sup>|</sup> lb(K)<sup>|</sup> literals in the clauses from <sup>N</sup>; if <sup>C</sup><sup>p</sup> is the resolvent between two clauses C<sup>i</sup>- , C<sup>j</sup> then the respective resolution step between the associated clauses <sup>D</sup><sup>q</sup>- , D<sup>q</sup>- upon the respective associated literals K and L is sound because we can get K τ <sup>=</sup> comp(L )<sup>τ</sup> using Lemma 8; if <sup>C</sup><sup>p</sup> is the factor on some literal <sup>K</sup> with lb(K ) <sup>⊆</sup> lb(K), then the respective factoring steps <sup>D</sup><sup>q</sup>+1,...,D<sup>q</sup>+<sup>s</sup> are also sound: each pair of the <sup>s</sup> associated literals <sup>M</sup> and M from C<sup>o</sup> <sup>j</sup> and <sup>C</sup><sup>o</sup>- <sup>j</sup> are unifiable because Mτ <sup>=</sup> <sup>M</sup> τ ; if <sup>C</sup><sup>p</sup> is the factor of <sup>C</sup><sup>p</sup>−<sup>1</sup> upon some literal <sup>K</sup> and <sup>L</sup> with {lb(K ), lb(L )} ⊆ lb(K), the respective factoring step on the associated clause D<sup>q</sup> is also sound by Lemma 8. Therefore π is a deduction from N <sup>∪</sup> S.

2. By Lemma 8.5, C<sup>k</sup>τ <sup>=</sup> D<sup>l</sup>τ . The derivation of clauses <sup>C</sup><sup>k</sup>, C<sup>k</sup>+1,...,C<sup>n</sup> only depends on the input clauses by assumption. By an inductive argument we get C<sup>k</sup>+1τ <sup>=</sup> C <sup>k</sup>+1τ yielding C<sup>n</sup>τ <sup>=</sup> C nτ .

3. The clauses in [C<sup>o</sup> <sup>i</sup>+1,...,C<sup>o</sup> <sup>j</sup> ] have the measure 0 as their original ones in [C<sup>i</sup>+1,...,C<sup>j</sup> ] because they are in <sup>N</sup> <sup>∪</sup>O<sup>∗</sup>. The clauses in [C<sup>k</sup>,...,C<sup>n</sup>] also retain their original measures. The clauses in [D<sup>1</sup>,...,D<sup>l</sup>] are s.t. Σ<sup>l</sup> <sup>k</sup>=1μ(π , D<sup>k</sup>) < Σi <sup>k</sup>=1μ(π , Ck). More specifically, any C <sup>∈</sup> [C1,...,Ci] that is not in N <sup>∪</sup>O∗(with measure μ(C, π) <sup>≥</sup> 1) and containing K with lb(K ) <sup>⊆</sup> lb(K) is associated with <sup>D</sup><sup>q</sup> <sup>∈</sup> <sup>O</sup><sup>∗</sup> \ <sup>N</sup> having the measure <sup>μ</sup>(Dq, π ) = 0, while all other clauses in [D1,...,Dl] are either copied from π with the same measure as before or new in π but have the measure 0.

By induction on the length of the sequence [C1,...,Ci] we prove the following property: if D is associated with a clause C <sup>∈</sup> [C1,...,Ci] and C contains some literal in {K <sup>|</sup> lb(K ) <sup>⊆</sup> lb(K)}, then D <sup>∈</sup> N <sup>∪</sup>O<sup>∗</sup> and μ(D, π ) = 0. Let C <sup>=</sup> Cp. Let <sup>D</sup><sup>1</sup>,...,D<sup>q</sup> be the clauses generated until <sup>C</sup>p−<sup>1</sup> s.t. the property already holds.


Finally, by the choice of <sup>C</sup><sup>i</sup>, <sup>C</sup><sup>j</sup> , and <sup>C</sup><sup>k</sup>, there must exist at least one <sup>C</sup><sup>p</sup> with some literal from {K <sup>|</sup> lb(K ) <sup>⊆</sup> lb(K)} but associated with some D such that D <sup>∈</sup> O<sup>∗</sup> from case (iii) or (iv) before. This also means μ(D, π ) = 0. The clause <sup>C</sup><sup>i</sup> has this property as it contains <sup>K</sup>. In addition, any <sup>C</sup><sup>p</sup> has a nonzero measure because <sup>C</sup><sup>i</sup> ∈ <sup>N</sup> <sup>∪</sup> <sup>O</sup><sup>∗</sup> and <sup>C</sup><sup>p</sup> is used to prove <sup>C</sup><sup>i</sup>. Therefore, we have <sup>μ</sup>(C<sup>p</sup>, π) <sup>&</sup>gt; μ(D, π ) = 0. As these clauses are never copied to π , μ(π ) < μ(π).

Eventually, by an inductive argument we prove our main result.

**Theorem 11 (Generalized SOS Completeness).** *There is an SOS resolution refutation from* (N,S) *if and only if there is resolution refutation from* N <sup>∪</sup>S *that contains at least one clause from* S*.*

*Proof.* "⇒": Obvious: If there is no refutation from <sup>N</sup> <sup>∪</sup> <sup>S</sup> using a clause <sup>S</sup> then there can also not be any SOS resolution refutation from (N,S).

"⇐": If there is a deduction refutation π from N <sup>∪</sup> S that contains at least one clause from <sup>S</sup>, then by an inductive argument on <sup>μ</sup> it can be transformed into an SOS deduction refutation with SOS S, and the result follows by Corollary 5. If μ(π) = 0 then π is already an SOS deduction, Lemma 9. For otherwise, we transform the deduction π into a deduction π according to (1), (2), and (3). A refutation always contains at least one resolution step, so by Lemma 10, π is also a refutation from N <sup>∪</sup> S and μ(π ) < μ(π). Eventually, π can be transformed into a label-disjoint deduction by assigning fresh labels to all used clauses from N <sup>∪</sup> S.

As an example for the "⇒" direction consider the propositional logic clause set N <sup>=</sup> {P, <sup>¬</sup>P} and SOS S <sup>=</sup> {Q}. Obviously, there is no refutation of N <sup>∪</sup> S using <sup>Q</sup> and there is no SOS refutation. Theorem <sup>11</sup> also guarantees that the consecutive application of the proof transformation steps (1), (2), and (3), page 334, results in an effective recursive procedure that transforms non-SOS refutations into SOS refutations.

# **4 A new Notion of Relevance**

The idea of our notion of relevance is to separate clauses that are ultimately needed in a refutation proof called *relevant*, from clauses that are useful called *semi-relevant*, from clauses that are not needed called *irrelevant*.

**Definition 12 (Relevance).** *Given an unsatisfiable set of clauses* <sup>N</sup>*, a clause* C <sup>∈</sup> N *is* relevant *if for all deduction refutations* π *of* N *it holds that* C <sup>∈</sup> π*. A clause* C <sup>∈</sup> N *is* semi-relevant *if there exists a deduction refutation* π *of* N *in which* C <sup>∈</sup> π*. A clause* C <sup>∈</sup> N *is* irrelevant *if there is no deduction refutation* π *of* N *in which* C <sup>∈</sup> π*.*

With respect to our example clause set N from Section <sup>2</sup> and its refutation, Figure 1, clause (5) is semi-relevant but not relevant, because the clauses (1),(2), (6), (9),(11) are already unsatisfiable. The clauses (1),(2), (6), (9),(11) are all relevant.

**Lemma 13 (Relevance).** *Given an unsatisfiable set of clauses* N*, the clause* C <sup>∈</sup> N *is relevant if and only if* N \ {C} *is satisfiable.*

*Proof.* Obvious: if <sup>N</sup> \{C} is satisfiable there is no resolution refutation and since N is unsatisfiable C must occur in all refutations. If C occurs in all refutations there is no refutation without C so N \ {C} is satisfiable.

**Lemma 14 (Semi-Relevance Test).** *Given a set of clauses* N*, and a clause* C <sup>∈</sup> N*,* C *is semi-relevant if and only if* (N \{C}, {C}) <sup>⇒</sup><sup>∗</sup> RES (<sup>N</sup> \{C}, S∪{⊥})*.*

*Proof.* If (N \{C}, {C}) <sup>⇒</sup><sup>∗</sup> RES (<sup>N</sup> \{C}, S∪{⊥}) then we have found a refutation containing C. On the other hand, by Theorem 11, Lemma <sup>7</sup> and Corollary 5, if there is a refutation containing C, then there is also an SOS refutation with SOS {C}.

An immediate consequence of the above test and completeness of resolution for first-order logic is the following corollary.

**Corollary 15 (Complexity of the Semi-Relevance Test).** *Testing semirelevance in first-order logic is semi-decidable. It is decidable for all fragments where resolution constitutes a decision procedure.*

Fragments where our semi-relevance test is guaranteed to terminate are for example first-order fragments enjoying the bounded model property, such as the Bernays-Schoenfinkel fragment [3].

# **5 Conclusion**

We have extended and sharpened the original completeness result for SOS resolution [18], Theorem 11. The generalized SOS completeness result can actually be used to effectively test clauses for semi-relevance in case resolution constitutes a decision procedure for the respective clause set. This is for example the case for all fragments enjoying the bounded model property, such as the Bernays-Schoenfinkel fragment [3]. In general, our approach yields a semi-decision procedure for semirelevance.

Our proof is based on deductions having an a priori tree structure. However, this is not a principle restriction. It just simplifies the transformation introduced in Section 2: renamings have only to be considered on input clauses. In a setting where proofs forming directed acyclic graphs are considered, renamings have to be carried all over a deduction, adding further technicalities to our transformation.

It is well-known that changing the ordering of resolution steps in a resolution deduction may exponentially increase or exponentially decrease the length of the deduction. Therefore, our transformation of a deduction into an SOS deduction may also yield an exponential growth in the length of the deduction. It may also be the other way round if, e.g, subsumption is added to the transformation. It is also not difficult to find examples where the transformation of Section 2 introduces redundant clauses. Recall that we have not made any assumption with respect to redundancy on deductions. So an open question is whether corresponding results hold on non-redundant deductions and what they actually mean for a respective notion of relevance.

An open problem is the question whether a test for semi-relevance can be established with more restricted resolution calculi such as ordered resolution. In general, the SOS strategy is not complete with ordered resolution. However, it is complete with respect to a clause set saturated by ordered resolution. The technical obstacle here is that a saturated clause set may already contain the empty clause, because for our generalized completeness result and the respective relationship to semi-relevance, the set N may still be unsatisfiable without the clause C to be tested for semi-relevance.

**Acknowledgments:** This work was funded by DFG grant 389792660 as part of TRR 248. We thank our reviewers for their valuable comments.

# **References**


Conference, ACSC '95, Pathumthani, Thailand, December 11-13, 1995, Proceedings. Lecture Notes in Computer Science, vol. 1023, pp. 269–285. Springer (1995)


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# A Unifying Splitting Framework

Gabriel Ebner<sup>1</sup> () , Jasmin Blanchette1,2,<sup>3</sup> , and Sophie Tourret2,<sup>3</sup>

<sup>1</sup> Vrije Universiteit Amsterdam, Amsterdam, the Netherlands {g.e.ebner,j.c.blanchette}@vu.nl

<sup>2</sup> Université de Lorraine, CNRS, Inria, LORIA, Nancy, France {jasmin.blanchette,sophie.tourret}@inria.fr

<sup>3</sup> Max-Planck-Institut für Informatik, Saarland Informatics Campus, Saarbrücken, Germany

{jasmin.blanchette,stourret}@mpi-inf.mpg.de

Abstract. AVATAR is an elegant and effective way to split clauses in a saturation prover using a SAT solver. But is it refutationally complete? And how does it relate to other splitting architectures? To answer these questions, we present a unifying framework that extends a saturation calculus (e.g., superposition) with splitting and embeds the result in a prover guided by a SAT solver. The framework also allows us to study locking, a subsumption-like mechanism based on the current propositional model. Various architectures are instances of the framework, including AVATAR, labeled splitting, and SMT with quantifiers.

#### 1 Introduction

One of the great strengths of saturation calculi such as superposition [1] is that they avoid case distinctions. Derived clauses hold unconditionally, and the prover can stop as soon as it derives the empty clause, without having to backtrack. The drawback is that these calculi often generate long, unwieldy clauses that slow down the prover. A remedy is to partition the search space by splitting a multiple-literal clause C<sup>1</sup> ∨···∨C<sup>n</sup> into variable-disjoint subclauses Ci. Splitting approaches include splitting with backtracking [24], splitting without backtracking [20], labeled splitting [10], and AVATAR [22].

The SAT-based AVATAR architecture is of particular interest because it is so successful. Voronkov reported that an AVATAR-enabled Vampire could solve 421 TPTP [21] problems that had never been solved before by any system [22, Sect. 9], a mind-boggling number. AVATAR works well in combination with the superposition calculus because it combines superposition's strong equality reasoning with the SAT solver's strong clausal reasoning. It is also appealing theoretically, because it gracefully generalizes traditional saturation provers and yet degenerates to a SAT solver if the problem is propositional.

Example 1. To illustrate the approach, we follow the key steps of an AVATARenabled resolution prover on the initial clause set containing ¬p(a), ¬q(z, z), and p(x) ∨ q(y, b). The disjunction can be split into p(x) ← {[p(x)]} and q(y, b) ← {[q(y, b)]}, where C ← {[C]} indicates that the clause C is enabled only in models in which the associated propositional variable [C] is true. A SAT solver is then

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5\_20 344–360, 2021.

run to choose a model J of [p(x)] ∨ [q(y, b)]. Suppose J makes [p(x)] true and [q(y, b)] false. Then resolving p(x) ← {[p(x)]} with ¬p(a) produces ⊥←{[p(x)]}, which closes the branch. Next, the SAT solver makes the right disjunct true, and resolving q(y, b) ← {[q(y, b)]} with ¬q(z, z) yields ⊥←{[q(y, b)]}. The SAT solver then reports "unsatisfiable," concluding the refutation.

What about refutational completeness? Far from being a purely theoretical concern, establishing completeness—or finding counterexamples—could yield insights and perhaps lead to an even stronger AVATAR. Before we can answer this open question, we must mathematize splitting. Our starting point is the *saturation framework* by Waldmann, Tourret, Robillard, and Blanchette [23], based on Bachmair and Ganzinger [2]. It covers a wide array of techniques, but "the main missing piece of the framework is a generic treatment of clause splitting" [23, p. 332]. We provide that missing piece, in the form of a *splitting framework*, and use it to show the completeness of an AVATAR-like architecture.

Our framework has five layers, linked by refinement. The first layer consists of a refutationally complete *base calculus*, such as resolution or superposition. It must be presentable as an inference system and a redundancy criterion.

From a base calculus, we derive a *splitting calculus* (Sect. 3). This extends the base calculus with splitting and inherits the base's completeness. It works on A-clauses or A-formulas C ← A, where A is a set of propositional literals.

Using the saturation framework, we can prove the dynamic completeness of an abstract prover, formulated as a transition system, that implements the splitting calculus. However, this ignores a vital component of AVATAR: the SAT solver. AVATAR considers only inferences involving A-formulas whose assertions are true in the current propositional model. The role of the third layer is to reflect this behavior. A *model-guided prover* operates on states of the form (J, N ), where J is a propositional model and N is a set of A-formulas (Sect. 4).

The fourth layer introduces AVATAR's *locking* mechanism (Sect. 5). With locking, an A-formula D ← B can be temporarily disabled by another A-formula C ←A if C subsumes D, even if A ⊆ B. Here we make a first discovery: AVATARstyle locking compromises completeness and must be curtailed.

Finally, the fifth layer is an *AVATAR-based prover* (Sect. 6). This refines the locking model-guided prover of the fourth layer with the given clause procedure, which saturates an A-formula set by distinguishing between active and passive A-formulas. Here we make another discovery: Selecting A-formulas fairly is not enough to guarantee completeness. We need a stronger criterion.

In a hypothetical tête-à-tête with the designers of labeled splitting, they might gently point out that by pioneering the use of a propositional model, including locking, they almost invented AVATAR themselves. Likewise, developers of SMT solvers might be tempted to claim that Voronkov merely reinvented SMT. To investigate such questions, we apply our framework to splitting without backtracking, labeled splitting, and SMT with quantifiers (Sect. 7). This gives us a solid basis for comparison as well as some new theoretical results.

A technical report [8] is available with the proofs, several counterexamples, and further details. A formalization using Isabelle/HOL [16] is underway.

# 2 Preliminaries

Our framework is parameterized by abstract notions of formulas, consequence relations, inferences, and redundancy. We largely follow the conventions of Waldmann et al. [23]. A-formulas generalize Voronkov's A-clauses [22].

Formulas. A set **F** of *formulas* is a set that contains a distinguished element ⊥ denoting falsehood. A *consequence relation* <sup>|</sup><sup>=</sup> <sup>⊆</sup> (P(**F**))<sup>2</sup> has the following properties for all M, N, P, Q ⊆ **F** and C, D ∈ **F**: (D1) {⊥} |= ∅; (D2) {C} |= {C}; (D3) if M ⊆ N and P ⊆ Q, then M |= P implies N |= Q; (D4) if M |= P and N |= Q ∪ {C} for every C ∈ M and N ∪ {D} |= Q for every D ∈ P, then N |= Q. The intended meaning of M |= N is -<sup>M</sup> <sup>−</sup>-→ N. From |=, we can easily derive a relation understood as -<sup>M</sup> <sup>−</sup>-→ -N, as required by the saturation framework.

The |= notation can be extended to allow negation on either side. Let **F**<sup>∼</sup> be defined as **F** {∼C | C ∈ **F**<sup>∼</sup> } such that ∼∼C = C. Given M,N ⊆ **F**<sup>∼</sup> , we have M |= N if and only if {C ∈ **F** | C ∈ M}∪{C ∈ **F** | ∼C ∈ N} |= {C ∈ **F** | ∼C ∈ M}∪{C ∈ **F** | C ∈ N}.

Following the saturation framework [23, p. 318], we distinguish between the consequence relation |= used for stating refutational completeness and a possibly stronger consequence relation |≈ for soundness. We require that |≈ is compact.

Example 2. In clausal first-order logic with equality, the formulas in **F** consist of clauses over a signature Σ. Each clause C is a finite multiset of literals L1,...,L<sup>n</sup> written C = L<sup>1</sup> ∨···∨ Ln. Each literal L is either an atom or its negation (¬), and each atom is an unoriented equation s ≈ t. We have M |= N if and only if every Σ-model of M also satisfies at least one clause in N.

Calculi and Derivations. A refutational calculus (*Inf* , *Red*) combines a set of inferences *Inf* and a redundancy criterion *Red*. We refer to Waldmann et al. [23] for the precise definitions. Recall in particular that *Inf*(N) is the set of inferences from <sup>N</sup>, *Inf*(N,M) = *Inf*(<sup>N</sup> <sup>∪</sup> <sup>M</sup>) \ *Inf*(<sup>N</sup> \ <sup>M</sup>), <sup>N</sup> is *saturated* w.r.t. *Inf* and *Red*<sup>I</sup> if *Inf*(N) <sup>⊆</sup> *Red*I(N), and (*Inf* , *Red*) is *statically* (*refutationally*) *complete* (w.r.t. <sup>|</sup>=) if ⊥ ∈ <sup>N</sup> for every <sup>N</sup> <sup>|</sup><sup>=</sup> {⊥} saturated w.r.t. *Inf* and *Red*I.

Let (Xi)<sup>i</sup> be a sequence of sets. Its *limit inferior* is <sup>X</sup><sup>∞</sup> <sup>=</sup> lim inf<sup>j</sup>→∞ <sup>X</sup><sup>j</sup> <sup>=</sup> i <sup>j</sup>≥<sup>i</sup> <sup>X</sup><sup>j</sup> , and its *limit superior* is <sup>X</sup><sup>∞</sup> <sup>=</sup> lim sup<sup>j</sup>→∞ <sup>X</sup><sup>j</sup> <sup>=</sup> i <sup>j</sup>≥<sup>i</sup> <sup>X</sup><sup>j</sup> . The elements of X<sup>∞</sup> are called *persistent*. A sequence (Ni)<sup>i</sup> over P(**F**) is *weakly fair* w.r.t. *Inf* and *Red*<sup>I</sup> if *Inf*(N∞) <sup>⊆</sup> <sup>i</sup> *Red*I(Ni) and *strongly fair* if (*Inf*(Ni))<sup>∞</sup> <sup>⊆</sup> <sup>i</sup> *Red*I(Ni). Given a relation , <sup>a</sup> *-derivation* is an infinite sequence such that x<sup>i</sup> xi+1 for every i. Finite runs can be extended to derivations via stuttering.

Let Red<sup>F</sup> <sup>⊆</sup> (P(**F**))<sup>2</sup> be the relation such that <sup>M</sup> Red<sup>F</sup> <sup>N</sup> if and only if <sup>M</sup> \ <sup>N</sup> <sup>⊆</sup> *Red* <sup>F</sup>(N). The calculus (*Inf* , *Red*) is *dynamically* (*refutationally*) *complete* (w.r.t. <sup>|</sup>=) if for every Red<sup>F</sup> -derivation (Ni)<sup>i</sup> that is weakly fair w.r.t. *Inf* and *Red*<sup>I</sup> and such that <sup>N</sup><sup>0</sup> <sup>|</sup><sup>=</sup> {⊥}, we have ⊥ ∈ <sup>N</sup><sup>i</sup> for some i.

A-Formulas. We fix throughout a countable set **V** of *propositional variables* v0, v1,.... For each v ∈ **V**, let ¬v ∈ ¬**V** denote its negation, with ¬¬v = v. We assume that a formula *fml*(v) <sup>∈</sup> **<sup>F</sup>** is associated with each <sup>v</sup> <sup>∈</sup> **<sup>V</sup>**. Intuitively, <sup>v</sup> approximates *fml*(v) at the propositional level. This definition is extended so that *fml*(¬v) = <sup>∼</sup>*fml*(v). An *assertion* <sup>a</sup> <sup>∈</sup> **<sup>A</sup>** <sup>=</sup> **<sup>V</sup>** ∪ ¬**<sup>V</sup>** is either a propositional variable <sup>v</sup> or its negation <sup>¬</sup>v. Given a formula <sup>C</sup> <sup>∈</sup> **<sup>F</sup>**<sup>∼</sup> , let *asn*(C) denote the set of assertions <sup>a</sup> <sup>∈</sup> **<sup>A</sup>** such that {*fml*(a)} |≈ {C} and {C} |≈ {*fml*(a)}.

A *propositional interpretation* J ⊆ **A** is a set such that for every v ∈ **V**, exactly one of v ∈ J and ¬v ∈ J holds. We reserve the letter J for interpretations, and define *fml*(J) = {*fml*(a) <sup>|</sup> <sup>a</sup> <sup>∈</sup> <sup>J</sup>}.

An *A-formula* over a set **F** of *base formulas* and an assertion set **A** is a pair C = (C, A) ∈ **AF** = **F**× Pfin(**A**), written C ←A, where C is a formula and A is a finite set of assertions {a1,...,an} understood as an implication <sup>a</sup>1∧···∧a<sup>n</sup> <sup>−</sup>-→ C. We identify C ← ∅ with C and define the projection C ← A = C. Moreover, N<sup>⊥</sup> is the set consisting of all A-formulas of the form ⊥ ← A ∈ N . We call such A-formulas *propositional clauses*. Note the use of calligraphic letters (e.g., C, N ) to range over A-formulas and sets of A-formulas.

We say that C←A ∈ **AF** is *enabled* in J if A ⊆ J. A set of A-formulas is *enabled* in J if all of its members are enabled in J. The *enabled projection* N<sup>J</sup> ⊆ N consists of the projections C of all A-formulas C enabled in J. Analogously, the *enabled projection Inf* <sup>J</sup> ⊆ *Inf* of a set *Inf* of **AF**-inferences consists of the projections ι of all inferences <sup>ι</sup> <sup>∈</sup> *Inf* whose premises are all enabled in <sup>J</sup>.

A propositional interpretation J is a *propositional model* of N⊥, written <sup>J</sup> <sup>|</sup><sup>=</sup> <sup>N</sup>⊥, if <sup>⊥</sup> <sup>∈</sup>/ (N⊥)J. Moreover, we write <sup>J</sup> |≈ N<sup>⊥</sup> if <sup>⊥</sup> <sup>∈</sup>/ (N⊥)<sup>J</sup> or *fml*(J) |≈ {⊥}. A set N<sup>⊥</sup> is *propositionally satisfiable* if there exists an interpretation J such that J |= N⊥. In contrast to consequence relations, propositional modelhood |= interprets the set N<sup>⊥</sup> conjunctively: J |= N<sup>⊥</sup> is understood as J |= - N⊥.

Finally, we lift |= and |≈ from P(**F**) to P(**AF**): M |= N if and only if M<sup>J</sup> |= N for every J in which N is enabled, and M |≈ N if and only if *fml*(J) ∪ M<sup>J</sup> |≈ N for every <sup>J</sup> in which <sup>N</sup> is enabled.

Example 3. In the original AVATAR [22], the connection between first-order clauses and assertions takes the form of a function []: **F** → **A**. The encoding is such that [¬C] = ¬[C] for every ground unit clause C and [C]=[D] if and only if C is syntactically equal to D up to variable renaming. This can be supported in our framework by letting *fml*(v) = C for some C such that [C] = v, for every v.

## 3 Splitting Calculi

Let **F** be a set of base formulas equipped with ⊥, |=, and |≈. The relation |≈ is assumed to be nontrivial: (D5) ∅ |≈ ∅. Let **A** be a set of assertions over **V** and **AF** be the set of A-formulas over **F** and **A**. Let (*FInf* , *FRed*) be a base calculus for **F**, where *FRed* is a redundancy criterion that additionally satisfies (1) an inference is *FRed*I-redundant if one of its premises is *FRed* <sup>F</sup>-redundant; (2) <sup>⊥</sup> <sup>∈</sup>/ *FRed* <sup>F</sup>(N) for every <sup>N</sup> <sup>⊆</sup> **<sup>F</sup>**; and (3) <sup>C</sup> <sup>∈</sup> *FRed* <sup>F</sup>({⊥}) for every <sup>C</sup> <sup>=</sup> <sup>⊥</sup>. These requirements can easily be met by a well-designed redundancy criterion [1, Sect. 4.3].

Below, we will define the *splitting calculus* induced by the base calculus. We will see that it not only is statically and dynamically complete w.r.t. |=, but also meets stronger, "local completeness" criteria that capture model switching.

The Inference Rules. We start with the mandatory inference rules.

Definition 4. The *splitting inference system SInf* consists of all instances of

$$\frac{(C\_i \leftarrow A\_i)\_{i=1}^n}{D \leftarrow A\_1 \cup \cdots \cup A\_n} \text{BASE} \qquad \frac{(\bot \leftarrow A\_i)\_{i=1}^n}{\bot} \text{UnsAT}$$

For Base, the side condition is (Cn,...,C1, D) <sup>∈</sup> *FInf* . For Unsat, the side condition is that {⊥ ← A1,..., ⊥ ← An} is propositionally unsatisfiable.

In addition, the following optional inference rules can be used:

$$\begin{array}{c} C \leftarrow A\\ \hline \bot \leftarrow \{\neg a\_1, \dots, \neg a\_n\} \cup A \quad (C\_i \leftarrow \{a\_i\})\_{i=1}^n \end{array} \text{SPLT}$$

$$\begin{array}{c} \frac{(\bot \leftarrow A\_i)\_{i=1}^n \quad C \leftarrow A}{(\bot \leftarrow A\_i)\_{i=1}^n} \text{COLLECT} \qquad \frac{(\bot \leftarrow A\_i)\_{i=1}^n \quad C \leftarrow A \cup B}{(\bot \leftarrow A\_i)\_{i=1}^n \quad C \leftarrow B} \text{TRIM} \\\\ \frac{(\bot \leftarrow A\_i)\_{i=1}^n}{\bot} \text{STRNGUNSAT} \quad \frac{C \leftarrow A}{\bot \leftarrow \{\neg a\} \cup A} \text{APPRX} \quad \frac{}{C \leftarrow A} \text{TAUTo} \end{array}$$

The following side conditions apply. For Split: <sup>C</sup> <sup>=</sup> <sup>⊥</sup> is splittable into <sup>C</sup>1,..., <sup>C</sup><sup>n</sup> and <sup>a</sup><sup>i</sup> <sup>∈</sup> *asn*(Ci) for each i. A formula <sup>C</sup> is *splittable* into two or more formulas <sup>C</sup>1,...,C<sup>n</sup> if {C} |≈ {C1,...,Cn} and <sup>C</sup> <sup>∈</sup> *FRed* <sup>F</sup>({Ci}) for each i. For Collect: <sup>C</sup> <sup>=</sup> <sup>⊥</sup> and {⊥ ← <sup>A</sup>i}<sup>n</sup> <sup>i</sup>=1 |≈ {⊥ ← <sup>A</sup>}. For Trim: <sup>C</sup> <sup>=</sup> <sup>⊥</sup> and {⊥ ← <sup>A</sup>i}<sup>n</sup> <sup>i</sup>=1 ∪ {⊥ ← <sup>A</sup>} |≈ {⊥ ← <sup>B</sup>}. For StrongUnsat: {⊥ ← <sup>A</sup>i}<sup>n</sup> <sup>i</sup>=1 |≈ {⊥}. For Approx: <sup>a</sup> <sup>∈</sup> *asn*(C). For Tauto: |≈ {<sup>C</sup> <sup>←</sup> <sup>A</sup>}.

The three rules identified by double bars are simplifications; they replace their premises with their conclusions in the current A-formula set. The premises' removal is justified by *SRed* <sup>F</sup>, defined below. Also note that Base preserves the soundness of *FInf* w.r.t. |≈ and that the other rules are sound w.r.t. |≈.

The Split rule performs an n-way case split on C. Each case C<sup>i</sup> is approximated by an assertion ai. The first conclusion expresses that the case distinction is exhaustive. The n other conclusions assume C<sup>i</sup> if its approximation a<sup>i</sup> is true. In a clausal prover, typically C = C<sup>1</sup> ∨···∨ Cn, where the subclauses C<sup>i</sup> have mutually disjoint sets of variables and form a maximal split.

Collect and Trim do some garbage collection. StrongUnsat is a variant of Unsat that uses |≈ instead of <sup>|</sup>=. It might correspond to invoking an SMT solver [3] (|≈) with a time limit, falling back on a SAT solver (|=). Approx can be used to make any derived A-formula visible to |≈. Tauto allows communication in the other direction, from the SAT solver to the calculus.

Example 5. Suppose the base calculus is first-order resolution [2] and the initial clauses are <sup>¬</sup>p(a), <sup>¬</sup>q(z, z), and <sup>p</sup>(x) <sup>∨</sup> <sup>q</sup>(y, <sup>b</sup>), as in Example 1. Split replaces the last clause by ⊥ ← {¬v0, <sup>¬</sup>v1}, <sup>p</sup>(x) ← {v0}, and <sup>q</sup>(y, <sup>b</sup>) ← {v1}. Two Base inferences then generate ⊥←{v0} and ⊥←{v1}. Finally, Unsat generates <sup>⊥</sup>.

#### The Redundancy Criterion. Next, we lift the base redundancy criterion.

Definition 6. The *splitting redundancy criterion SRed* = (*SRed*I, *SRed* <sup>F</sup>) is specified as follows. An A-formula C ← A ∈ **AF** is redundant w.r.t. N , written <sup>C</sup> <sup>←</sup><sup>A</sup> <sup>∈</sup> *SRed* <sup>F</sup>(<sup>N</sup> ), if (1) <sup>C</sup> <sup>∈</sup> *FRed* <sup>F</sup>(NJ) for every propositional interpretation J ⊇ A or (2) there exists an A-formula C ← B ∈ N with B ⊂ A. An inference <sup>ι</sup> <sup>∈</sup> *SInf* is redundant w.r.t. <sup>N</sup> , written <sup>ι</sup> <sup>∈</sup> *SRed*I(<sup>N</sup> ), if (1) <sup>ι</sup> is a Base inference and {ι}<sup>J</sup> <sup>⊆</sup> *FRed*I(NJ) for every <sup>J</sup> or (2) <sup>ι</sup> is an Unsat inference and ⊥∈N .

*SRed* qualifies as a redundancy criterion. It can justify the deletion of Aformulas that are propositionally tautological. It also allows other simplifications, as long as the assertions on A-formulas used to simplify a given C ← A are contained in A. If the base criterion *FRed* <sup>F</sup> supports subsumption, this also extends to A-formulas: <sup>D</sup>←<sup>B</sup> <sup>∈</sup> *SRed* <sup>F</sup>({<sup>C</sup> <sup>←</sup>A}) if <sup>D</sup> is strictly subsumed by <sup>C</sup> and B ⊇ A, or if C = D and B ⊃ A.

Local Saturation. It is not difficult to show that if (*FInf* , *FRed*) is statically complete, then (*SInf* , *SRed*) is statically and hence dynamically complete. However, this result fails to capture a key aspect of most splitting architectures. Since SRed<sup>F</sup> -derivations have no notion of current split branch or model J, they must also perform disabled inferences. To respect enabledness, we need a weaker notion of saturation. If an A-formula set is consistent, it should suffice to saturate w.r.t. a single propositional model. In other words, if no A-formula ⊥ ← A ⊆ J is derivable for some model J |= N⊥, the prover should be allowed to give a verdict of "consistent." We will call such model-specific saturations *local*.

Definition 7. A set N ⊆ **AF** is *locally saturated* w.r.t. *SInf* and *SRed*<sup>I</sup> if either ⊥∈N or there exists <sup>J</sup> <sup>|</sup><sup>=</sup> <sup>N</sup><sup>⊥</sup> such that <sup>N</sup><sup>J</sup> is saturated w.r.t. *FInf* and *FRed*I.

Theorem 8 (Strong static completeness). *Assume* (*FInf* , *FRed*) *is statically complete. Given a set* N ⊆ **AF** *that is locally saturated w.r.t. SInf and SRed*<sup>I</sup> *and such that* N |<sup>=</sup> {⊥}, *we have* ⊥∈N .

Example 9. Consider the A-clause set {⊥← {¬[p(x)], ¬[q(y)]}, p(x)← {[p(x)]}, q(y)← {[q(y)]}, ¬q(a)} expressed using AVATAR conventions. It is not saturated for resolution, because the conclusion ⊥←{[q(y)]} of resolving the last two A-clauses is missing, but it is locally saturated with J ⊇ {[p(x)], ¬[q(y)]}.

Definition 10. A sequence (Ni)<sup>i</sup> of sets of A-formulas is *locally fair* w.r.t. *SInf* and *SRed*<sup>I</sup> if either ⊥∈N<sup>i</sup> for some <sup>i</sup> or there exists <sup>J</sup> <sup>|</sup>= (N∞)<sup>⊥</sup> such that *FInf*((N∞)J) <sup>⊆</sup> <sup>i</sup> *FRed*I((Ni)J).

Theorem 11 (Strong dynamic completeness). *Assume* (*FInf* , *FRed*) *is statically complete. Given an* SRed<sup>F</sup> *-derivation* (Ni)<sup>i</sup> *that is locally fair w.r.t. SInf and SRed*<sup>I</sup> *and such that* <sup>N</sup><sup>0</sup> <sup>|</sup><sup>=</sup> {⊥}, *we have* ⊥∈N<sup>i</sup> *for some* i.

In Sects. 4 to 6, we will review three transition systems of increasing complexity, culminating with an idealized specification of AVATAR. They will be linked by a chain of stepwise refinements, like pearls on a string. All derivations using these will correspond to SRed<sup>F</sup> -derivations, and their fairness criteria will imply local fairness. Consequently, by Theorem 11, they will all be complete.

## 4 Model-Guided Provers

AVATAR and other splitting architectures maintain a model of the propositional clauses, which represents the split tree's current branch. We can capture this abstractly by refining SRed<sup>F</sup> -derivations to incorporate a propositional model.

The states are now pairs (J, N ), where J is a propositional model and N ⊆ **AF**. Initial states have the form (J, N), where N ⊆ **F**. The *model-guided prover* MG is defined by the following transition rules:


From an <sup>=</sup>⇒MG-derivation, we obtain an SRed<sup>F</sup> -derivation by simply erasing the J components. The Derive rule can add new A-formulas and delete redundant A-formulas. <sup>J</sup> should be a model of <sup>N</sup><sup>⊥</sup> most of the time; when it is not, Switch can be used to switch model or StrongUnsat to finish the refutation.

Example 12. Let us revisit Example 5. Initially, let J<sup>0</sup> = {¬v0, ¬v1}. After the split, we have ¬p(a), ¬q(z, z), p(x) ← {v0}, q(y, b) ← {v1}, and ⊥ ← {¬v0, ¬v1}. The natural option is to switch model. We take J<sup>1</sup> = {v0, ¬v1}. We then derive ⊥←{v0}. Since J<sup>1</sup> |= ⊥←{v0}, we switch to J<sup>2</sup> = {¬v0, v1}, where we derive ⊥←{v1}. Finally, we detect that the propositional clauses are unsatisfiable.

We need a fairness criterion for MG that implies local fairness of the underlying SRed<sup>F</sup> -derivation. The latter requires a witness J but gives us no hint as to where to look for one. Our solution involves a topological concept: J is a *limit point* in (Ji)<sup>i</sup> if there exists a subsequence (J <sup>i</sup>)<sup>i</sup> of (Ji)<sup>i</sup> such that J = J <sup>∞</sup> <sup>=</sup> J∞.

Example 13. Let (Ji)<sup>i</sup> be the sequence such that J2<sup>i</sup> ∩ **V** = {v1, v3,..., v2i−<sup>1</sup>} (i.e., v1, v3,..., v<sup>2</sup>i−<sup>1</sup> are true and the other variables are false) and J2i+1 = (J2<sup>i</sup> \{¬v2i})∪{v2i}. Although it is not in the sequence, the interpretation J∩**V** = {v1, v3,...} is a limit point. The associated split tree is shown in Fig. 1. The direct path from the root to a node J<sup>i</sup> specifies the assertions that are true in Ji.

Example 14. Let (Ji)<sup>i</sup> be such that J<sup>0</sup> ∩ **V** = ∅, J<sup>4</sup>i+1 ∩ **V** = {v0}∪{v<sup>4</sup>j+3 | j<i}, J<sup>4</sup>i+2 ∩**V** = {v0, v4i+2}∪{v4j+3 | j<i}, J<sup>4</sup>i+3 ∩**V** = {v4j+1 | j ≤ i}, and J<sup>4</sup>i+4 ∩ **V** = {v<sup>4</sup>j+1 | j ≤ i}∪{v<sup>4</sup>i+4}. This sequence has two limit points: J = lim inf <sup>i</sup>→∞ J<sup>4</sup>i+1 and J = lim inf <sup>i</sup>→∞ J<sup>4</sup>i+3. The split tree is depicted in Fig. 2.

Basic topology tells us that every sequence has a limit point. No matter how erratically the prover switches branches, it will fully explore at least one of them. It then suffices to perform the base *FInf* -inferences fairly in that branch:

Definition 15. An =⇒MG-derivation (Ji, Ni)<sup>i</sup> is *fair* if either (1) ⊥∈N<sup>i</sup> for some i or (2) J<sup>i</sup> |= (Ni)<sup>⊥</sup> for infinitely many indices i and there exists a limit point <sup>J</sup> of (Ji)<sup>i</sup> such that *FInf*((N∞)J) <sup>⊆</sup> <sup>i</sup> *FRed*I((Ni)J).

Fairness of an <sup>=</sup>⇒MG-derivation implies local fairness of the underlying SRed<sup>F</sup> derivation. A well-behaved propositional solver, as in labeled splitting, always gives rise to a single limit point J∞, which can be taken for J in Definition 15.

Fig. 1: A split tree with a single infinite branch

J

Fig. 2: A split tree with two infinite branches

By contrast, an unconstrained solver, as supported by AVATAR, can produce multiple limit points. Then it is more challenging to ensure fairness.

Example 16. Consider the consistent set consisting of ¬p(x), p(a) ∨ q(a), and ¬q(y) ∨ p(f(y)) ∨ q(f(y)). Splitting the second clause into p(a) and q(a) and resolving q(a) with the third clause yields p(f(a)) ∨ q(f(a)). This process can be iterated. Now suppose that v2<sup>i</sup> and v2i+1 are associated with p(f<sup>i</sup> (a)) and q(f<sup>i</sup> (a)), respectively. If we split every emerging p(f<sup>i</sup> (a))∨q(f<sup>i</sup> (a)) and the SAT solver always makes v2<sup>i</sup> true first, we end up with the situation of Example 13 and Fig. 1. For the limit point J, all *FInf* -inferences are performed. Thus, the derivation is fair.

Example 17. We build a clause set from two copies of Example 16, where each clause C from each copy i ∈ {1, 2} is extended to ¬r<sup>i</sup> ∨ C. We add the clause r<sup>1</sup> ∨r<sup>2</sup> and split it as our first move. From there, each branch imitates Example 16. A SAT solver might jump back and forth, as in Example 14 and Fig. 2. Even if A-clauses get disabled and re-enabled infinitely often, we must perform all nonredundant inferences in at least one of the two limit points (J or J).

## 5 Locking Provers

Next, we refine the model-guided prover into a locking prover that temporarily locks away A-formulas that are redundant locally w.r.t. some J but not globally. means that C ← A is "locally redundant" in interpretations J ⊇ B. The function - erases the locks: -<sup>L</sup> <sup>=</sup> {C | (B, <sup>C</sup>) ∈ L for some <sup>B</sup>}. Initial states have the form (J,N, ∅), where N ⊆ **F**. The *locking prover* is defined by these two rules: The states are triples (J, <sup>N</sup> ,L), with L⊆Pfin(**A**)×**AF**. Intuitively,(B, C←A)∈ L


We note that =⇒L-derivations refine =⇒MG-derivations, with states (J, N ,L) mapped to (J, N ∪ -<sup>L</sup>).

Locking can cause incompleteness, because an A-formula can be locally redundant at every point in the derivation and yet not be so at any limit point, thereby breaking local saturation. For example, if we have derived p(x) ← {¬vk} for every k, then p(c) is locally redundant in any J that contains ¬vk. For the models J<sup>i</sup> = {v1,..., vi, ¬vi+1,...}, the clause p(c) would always be locally redundant and ignored. Yet p(c) might not be locally redundant at the unique limit point J = **V**. We could rule out this counterexample by requiring that derivations are strongly fair—that is, every inference possible infinitely often must eventually be made redundant. However, we have found a counterexample showing that strong fairness does not ensure completeness [8, Example 46]. It would seem that this counterexample could arise with Vampire if the underlying SAT solver produces this specific sequence of interpretations.

Our solution is as follows. Let (Ji, Ni,Li)<sup>i</sup> be an =⇒L-derivation, let (J j )j be a subsequence of (Ji)i, and let (N <sup>j</sup> )<sup>j</sup> be the corresponding subsequence of (Ni)i. To achieve fairness, we now consider N <sup>∞</sup>, the A-formulas persistent in the unlocked subsequence (N <sup>j</sup> )<sup>j</sup> . By contrast, fairness of =⇒MG-derivations used N∞.

Definition 18. An =⇒L-derivation (Ji, Ni,Li)<sup>i</sup> is *fair* if either (1) ⊥ ∈ <sup>i</sup> N<sup>i</sup> or (2) J<sup>i</sup> |= (Ni)<sup>⊥</sup> for infinitely many indices i and there exists a subsequence (J j )j converging to a limit point <sup>J</sup> such that *FInf*((<sup>N</sup> <sup>∞</sup>)<sup>J</sup> <sup>∪</sup> ((lim sup<sup>j</sup>→∞-L <sup>j</sup>)<sup>J</sup> \ -<sup>L</sup>∞)J) <sup>⊆</sup> <sup>i</sup> *FRed*I((N<sup>i</sup> <sup>∪</sup>-<sup>L</sup><sup>i</sup>)J), where (<sup>N</sup> <sup>j</sup> )<sup>j</sup> and (L <sup>j</sup> )<sup>j</sup> correspond to (J <sup>j</sup> )<sup>j</sup> .

Fairness of an =⇒L-derivation implies fairness of the corresponding =⇒MGderivation. The condition on the sets L <sup>j</sup> ensures that inferences from A-formulas that are locked infinitely often, but not infinitely often with the same lock, are redundant at the limit point. In particular, if we know that each A-formula is locked at most finitely often, then lim sup<sup>j</sup>→∞-L <sup>j</sup> <sup>=</sup> -<sup>L</sup>∞ and the inclusion in the definition above simplifies to *FInf*((<sup>N</sup> <sup>∞</sup>)J) <sup>⊆</sup> <sup>i</sup> *FRed*I((N<sup>i</sup> <sup>∪</sup> -<sup>L</sup><sup>i</sup>)J).

## 6 AVATAR-Based Provers

AVATAR was unveiled in 2014 by Voronkov [22]. Since then, he and his colleagues studied many options and extensions [3, 17]. A second implementation, in Lean's super tactic, is due to Ebner [9]. Here we attempt to capture AVATAR's essence.

The abstract AVATAR-based prover we define in this section extends the locking prover L with a given clause procedure [13]. A-formulas are moved in turn from the passive to the active set, where inferences are performed. The heuristic for choosing the next *given* A-formula to move is guided by timestamps indicating when the A-formulas were derived, to ensure fairness.

Let **TAF** <sup>=</sup> **AF** <sup>×</sup> <sup>N</sup> be the set of *timestamped A-formulas*. Given <sup>N</sup> <sup>⊆</sup> **TAF**, we define N <sup>=</sup> {C | (C, t) <sup>∈</sup> <sup>N</sup> for some <sup>t</sup>}, and we overload existing notations to erase timestamps. Thus, N <sup>=</sup> N, <sup>N</sup><sup>⊥</sup> <sup>=</sup> N<sup>⊥</sup>, and so on. Note that we use a new set of calligraphic letters (e.g., C, N) to range over timestamped A-formulas and A-formulas sets. Using the saturation framework [23, Sect. 3], we lift (*SInf* , *SRed*) to a calculus (*TSInf* , *TSRed*) on **TAF** with the tiebreaker order <sup>&</sup>gt; on timestamps, so that (C, t <sup>+</sup> <sup>k</sup>) <sup>∈</sup> *TSRed* <sup>F</sup>({(C, t)}) for any k > <sup>0</sup>.

where A, P, and Q are respectively the sets of *active*, *passive*, and other (disabled or propositional) timestamped A-formulas, and L is the set of locked timestamped A-formulas such that (1) A<sup>⊥</sup> = P<sup>⊥</sup> = ∅, (2) A ∪ P is enabled in J, and (3) Q<sup>J</sup> ⊆ {⊥}. The *AVATAR-based prover* AV is defined as follows: A state is a tuple (J, A, P, Q, L) ∈ P(**A**) × P(**TAF**)<sup>3</sup> × P(Pfin(**A**) × **TAF**),


There is also a LockP rule that is identical to LockA except that it starts in the state (J, A, P {(C ← A, t)}, Q, L). An AV-derivation is *well timestamped* if every A-formula introduced by a rule is assigned a unique timestamp.

Let (Ji, Ai, Pi, Qi, Li)<sup>i</sup> be an =⇒AV-derivation. It is easy to see that it refines the <sup>=</sup>⇒L-derivation (Ji, A<sup>i</sup> <sup>∪</sup> <sup>P</sup><sup>i</sup> <sup>∪</sup> <sup>Q</sup><sup>i</sup>, L<sup>i</sup>)<sup>i</sup> and that the saturation invariant *TSInf*(Ai) <sup>⊆</sup> *TSRed*I(A<sup>i</sup> <sup>∪</sup> <sup>P</sup><sup>i</sup> <sup>∪</sup> <sup>Q</sup><sup>i</sup> <sup>∪</sup> -<sup>L</sup><sup>i</sup>) holds if <sup>A</sup><sup>0</sup> <sup>=</sup> <sup>∅</sup>.

In contrast with nonsplitting provers, for AV, fairness w.r.t. formulas does not imply fairness w.r.t. inferences. A problematic scenario involves two premises C, D of an inference ι and four transitions repeated forever, possibly with other steps interleaved: Infer makes C active; Switch disables it; Infer makes D active; Switch disables it. Even though C and D are selected in a strongly fair fashion, ι is never performed. We need an even stronger fairness criterion.

Definition 19. An <sup>=</sup>⇒AV-derivation (Ji, <sup>A</sup>i, <sup>P</sup>i, <sup>Q</sup>i, <sup>L</sup>i)<sup>i</sup> is *fair* if (1) ⊥ ∈ <sup>i</sup> <sup>Q</sup><sup>i</sup> or (2) J<sup>i</sup> |= (Qi)<sup>⊥</sup> for infinitely many indices i and there exists a subsequence (J j ) converging to a limit point J <sup>∞</sup> such that (3) lim inf<sup>j</sup>→∞ *TSInf*(A <sup>j</sup> , P <sup>j</sup> ) = ∅ and (4) (lim sup<sup>j</sup>→∞-L <sup>j</sup>)<sup>J</sup> \ -<sup>L</sup>∞<sup>J</sup> <sup>⊆</sup> <sup>i</sup> *FRed* <sup>F</sup>((A<sup>i</sup> <sup>∪</sup> <sup>P</sup><sup>i</sup> <sup>∪</sup> <sup>Q</sup><sup>i</sup> <sup>∪</sup> -<sup>L</sup><sup>i</sup>)J).

Condition (3) ensures that all inferences involving passive A-formulas are redundant at the limit point. It would not suffice to require P <sup>∞</sup> = <sup>∅</sup> because A-formulas can move back and forth between A, P, and Q, as we just saw. Condition (4) is similar to the condition on locks in Definition 18. If the =⇒AVderivation is fair, the corresponding =⇒L-derivation is also fair.

Many selection strategies are combinations of basic strategies, such as choosing the smallest formula by weight or the oldest by age. We capture such strategies using selection orders -. Intuitively, C -D if the prover will always select C before D if both are present. We use two selection orders: -**TAF**, based on timestamps, must be followed infinitely often; -**<sup>F</sup>** must be followed otherwise. For the first one, we can use age defined so that (C, t) age (C , t ) if t<t .

Definition 20. Let X be a set. A *selection order* on X is an irreflexive and transitive relation such that {<sup>y</sup> <sup>|</sup> <sup>y</sup> <sup>x</sup>} is finite for all <sup>x</sup> <sup>∈</sup> <sup>X</sup>.

The intersection of two orders -<sup>1</sup> and -<sup>2</sup> corresponds to the nondeterministic alternation between them. The prover may choose either a -<sup>1</sup>-minimal or a -<sup>2</sup>-minimal A-formula, at its discretion.

To ensure completeness, we must restrict the inferences that the prover may perform; otherwise, it could derive infinitely many A-formulas with different assertions, causing it to switch between two branches of the split tree without making progress. Given N ⊆ **AF**, let N = {A | C ← A ∈ N for some C}.

Definition 21. A function F : P(**AF**) → P(**AF**) is *strongly finitary* if F(N ) and F(N ) \ N are finite for any N ⊆ **AF** such that N is finite.

Intuitively, a strongly finitary function F returns finitely many base formulas and finitely many new assertions, although it may return infinitely many Aformulas. Clearly, <sup>F</sup>(<sup>N</sup> ) is finite for any finite N ⊆ **AF**. If *FInf*(N) is finite for any finite <sup>N</sup> <sup>⊆</sup> **<sup>F</sup>**, then performing *SInf* -inferences is strongly finitary. Deterministic Split rules, such as AVATAR's, are also strongly finitary. We can lift a strongly finitary <sup>F</sup> to any <sup>N</sup> <sup>⊆</sup> **TAF** by taking <sup>F</sup>**TAF**(N) = <sup>F</sup>(N) <sup>×</sup> <sup>N</sup>. If <sup>F</sup> and <sup>G</sup> are strongly finitary, then so is N → F(N ) ∪ G(N ).

Simplification rules used by the prover must be restricted even more to ensure completeness, because they can lead to new splits and assertions. For example, simplifying p(x ∗ 0) ∨ p(x) to p(0) ∨ p(x) transforms an unsplittable clause into a splittable one. If simplifications were to produce infinitely many such clauses, the prover might split and switch models forever without making progress.

Definition 22. Let ≺ be a well-founded relation on **F**, and let be its reflexive closure. A function S : **AF** → P(**AF**) is a *strongly finitary simplification bound* for ≺ if N → C∈N <sup>S</sup>(C) is strongly finitary and C C for all C ∈ S(C).

The prover may simplify an A-formula C to C only if C ∈ S(C). It may also delete C. Strongly finitary simplification bounds are closed under unions, allowing the combination of simplification techniques based on ≺. For superposition, a natural choice for ≺ is the clause order. The key property of strongly finitary simplification bounds is that if we saturate a finite set of A-formulas w.r.t. simplifications, the saturation is also finite.

Example 23. Let **F** be the set of first-order clauses and S(C ←A) = {C ←A | C is a subclause of C and A ⊆ A}. Then S is a strongly finitary simplification bound. This S covers many simplification techniques, including elimination of duplicate literals, deletion of resolved literals, and subsumption resolution.

Example 24. If the Knuth–Bendix order [12] is used and all weights are positive, then S(C ← A) = {C ← A | C ≺ C and A ⊆ A} is a strongly finitary simplification bound. This can be used to cover demodulation.

Equipped with the above definitions, we introduce a fairness criterion that is more concrete and easier to apply than fairness of =⇒AV-derivations. We could refine AV further and use this criterion to show the completeness of an imperative procedure such as Voronkov's extended Otter loop [22, Fig. 3], thus showing that Vampire with AVATAR is complete if locking is sufficiently restricted.

Lemma 25. *Let* I *be a strongly finitary function, and let* S *be a strongly finitary simplification bound. Then a well-timestamped* =⇒AV*-derivation* (Ji, Ai, Pi, Qi, Li)<sup>i</sup> *is fair if all of the following conditions hold:*


#### 7 Application to Other Architectures

AVATAR may be the most natural application of our framework, but it is not the only one. Below we complete the picture by studying splitting without backtracking, labeled splitting, and SMT with quantifiers.

Splitting without Backtracking. Before AVATAR, Riazanov and Voronkov [20] had already experimented with splitting in Vampire in a lighter variant without backtracking. They based their work on ordered resolution O with selection [2]. Weidenbach [24, end of Sect. 4.5] independently outlined the same technique. The basic idea is to extend the signature Σ with a countable set P of nullary predicate symbols and to augment the base calculus with a binary splitting rule that replaces a -clause C ∨D with twoΣP-clauses C ∨p and D∨¬p. Riazanov and Voronkov require that the precedence <sup>≺</sup> makes all <sup>P</sup>-literals smaller than the Σ-literals. Binary splitting is then a simplification. They also extend the selection function of the base calculus to support P-literals. Their *parallel* selection function imitates as much as possible the original selection function. ΣP

The calculus O<sup>P</sup> is closely related to an instance of our framework. Let **F** be the set of <sup>Σ</sup>-clauses, with the empty clause as <sup>⊥</sup>. Let <sup>O</sup> = (*FInf* , *FRed*) be the base calculus. We take **V** = P. Let LA = (*SInf* , *SRed*), whose name stands for *lightweight AVATAR*, be the induced splitting calculus. Lightweight AVATAR amounts to the splitting architecture Cruanes implemented in Zipperposition [7, Sect. 2.5]. Binary splitting can be realized in LA as a Split-like simplification

rule. The calculi O<sup>P</sup> and LA disagree slightly because OP's order ≺ can break ties using P-literals and because LA can detect unsatisfiability early using the Unsat rule. Despite its slightly weaker order, LA is tighter than O<sup>P</sup> in the sense that saturation w.r.t. O<sup>P</sup> implies saturation w.r.t. LA but not vice versa.

Labeled Splitting. Labeled splitting, as originally described by Fietzke and Weidenbach [10] and implemented in SPASS, is a first-order resolution-based calculus with binary splitting that traverses the split tree in a depth-first way, using an elaborate backtracking mechanism inspired by CDCL [15]. It works on states (Ψ, N ), where Ψ is a stack storing the current state of the split tree and N is a set of *labeled clauses*—clauses annotated with finite sets of natural numbers.

We model labeled splitting as an instance of the locking prover L based on the splitting calculus LS = (*SInf* , *SRed*) induced by the resolution calculus <sup>R</sup> = (*FInf* , *FRed*), where <sup>|</sup><sup>=</sup> and |≈ are as in Example 2 and **<sup>V</sup>** <sup>=</sup> <sup>i</sup>∈<sup>N</sup>{li,ri,si}. A-clauses correspond to labeled clauses. Splits are identified by unique *split levels*. Given a split on <sup>C</sup> <sup>∨</sup> <sup>D</sup> with level <sup>k</sup>, <sup>l</sup><sup>k</sup> <sup>∈</sup> *asn*(C) and <sup>r</sup><sup>k</sup> <sup>∈</sup> *asn*(D) represent the left and right branches. In practice, the prover would dynamically extend *fml* to ensure that *fml*(lk) = C and *fml*(rk) = D.

When splitting, if we simply added ⊥ ← {¬lk, ¬rk}, we would always need to consider either C ← {lk} or D ← {rk}, depending on the interpretation. However, labeled splitting can undo splits when backtracking. Yet fairness would require us to perform inferences with either C or D even when labeled splitting would not. We solve this as follows. Let <sup>=</sup> ∼⊥. We introduce the variable <sup>s</sup><sup>k</sup> <sup>∈</sup> *asn*() so that we can enable or disable the split. The StrongUnsat rule then knows that s<sup>k</sup> is true, but we can still switch to propositional models that disable both C and D. A-clauses are then split using the following binary variant of Split:

$$\frac{C \lor D \gets A}{\bot \gets \{\neg \mathbb{I}\_k, \neg \mathsf{r}\_k, \mathsf{s}\_k\} \quad C \gets A \cup \{\mathsf{l}\_k\} \quad D \gets A \cup \{\mathsf{r}\_k\}} \text{ SOFTSPLIT}$$

where C and D share no variables and k is the next split level. Unlike AVATAR, labeled splitting keeps the premise and might split it again with another level.

To emulate the original, the locking prover based on LS must repeatedly apply the following three steps in any order until saturation:


Switch is powerful enough to support all of Fietzke and Weidenbach's backtracking rules, but to explore the tree in the same order as they do, we must choose the new model carefully. If a left branch is closed, the model must be updated so as to disable the splits that were not used to close this branch and to enable the right branch. If a right branch is closed, the split must be disabled,

and the model must switch to the right branch of the closest enabled split above it with an enabled left branch. If a right branch is closed but there is no split above with an enabled left branch, the entire tree has been visited. Then, a propositional clause ⊥ ← A with A ⊆ <sup>i</sup>{si} is |=-entailed by the A-clause set, and StrongUnsat can finish the refutation by exploiting *fml*(si) = .

The above strategy helps achieve fairness, because it ensures that there exists exactly one limit point. It also uses locks in a well-behaved way. This means we can considerably simplify the notion of fairness for =⇒L-derivations and obtain a criterion that is almost identical to, but slightly more liberal than, Fietzke and Weidenbach's—thereby re-proving the completeness of labeled splitting.

For terminating derivations, their fairness criterion coincides with ours. For diverging derivations, Fietzke and Weidenbach construct a limit subsequence (Φ i, N <sup>i</sup> )<sup>i</sup> of the derivation (Φi, Ni)<sup>i</sup> and require that every persistent inference in it be made redundant, exactly as we do for =⇒L-derivations. The subsequence consists of all states that lie on the split tree's unique infinite branch. Locks are well behaved, with lim sup<sup>j</sup>→∞-L <sup>j</sup> <sup>=</sup> -<sup>L</sup>∞, because with the strategy above, once an A-clause is enabled on the rightmost branch, it remains enabled forever. Our definition of fairness allows more subsequences, although this is difficult to exploit without bringing in all the theoretical complexity of AVATAR.

SMT with Quantifiers. Satisfiability modulo theories (SMT) solvers based on DPLL(T) [15] combine a SAT solver with theory solvers. In the classical setup, the theories are decidable, and the SMT solver is a decision procedure for the union of the theories. Some SMT solvers also support quantified formulas via instantiation at the expense of decidability.

Complete instantiation strategies have been developed for various fragments of first-order logic [11, 18, 19]. In particular, enumerative quantifier instantiation [18] is complete under some conditions. An SMT solver following such a strategy ought to be refutationally complete, but this has never been proved. Although SMT is quite different from the architectures considered above, we can instantiate our framework to show the completeness of an abstract SMT solver. The model-guided prover MG will provide a suitable starting point.

Let **F** be the set of first-order Σ-formulas. We represent the SMT solver's underlying SAT solver by the Unsat rule and complement it with an inference system *FInf* that includes rules for clausification outside quantifiers, theory reasoning, and instantiation. The clausification rules derive C and D from a premise C ∧ D, among others; the theory rules derive ⊥ from some Σ-formula set N such that N |= {⊥}, ignoring quantifiers; and the instantiation rules derive ϕ(u) from premises <sup>∀</sup>x. ϕ(x), where <sup>u</sup> is a ground term. For *FRed*, we take an arbitrary instance of standard redundancy. Its only purpose is to split disjunctions destructively. We define the "theories with quantifiers" calculus TQ = (*FInf* , *FRed*). For |= and |≈, we use entailment in the supported theories including quantifiers.

We use the same approximation function as in AVATAR (Example 3). Let us call C ← A a *subunit* if C is not a disjunction. Whenever a (ground) disjunction <sup>C</sup> <sup>∨</sup>D←<sup>A</sup> emerges, we immediately apply Split. This delegates clausal reasoning to the SAT solver. It then suffices to assume that TQ is complete for subunits.

Theorem 26 (Dynamic completeness). *Assume* TQ *is statically complete for subunit sets. Let* (Ji, Ni)<sup>i</sup> *be a fair* =⇒MG*-derivation based on* TQ*. If* N<sup>0</sup> |= {⊥} *and* N<sup>∞</sup> *contains only subunits, then* ⊥∈N<sup>j</sup> *for some* j.

Like AVATAR-based provers, SMT solvers will typically not perform all *SInf* inferences, not even up to *SRed*I. Given <sup>a</sup> <sup>≈</sup> <sup>b</sup>← {v0}, <sup>b</sup> <sup>≈</sup> <sup>c</sup>← {v1}, <sup>a</sup> <sup>≈</sup> <sup>d</sup>← {v2}, c ≈ d ← {v3}, and a ≈ c ← {v4}, an SMT solver will find only one of the conflicts ⊥←{v0, v1, v4} or ⊥←{v2, v3, v4} but not both. For decidable theories, a practical fair strategy is to instantiate quantifiers only if no other rules are applicable.

Our mathematization of AVATAR and SMT with quantifiers exposes their dissimilarities. With SMT, splitting is mandatory, and there is no subsumption or simplification, locking, or active and passive sets. And of course, theory inferences are n-ary and quantifier instantiation is unary, whereas superposition is binary. Nevertheless, their completeness follows from the same principles.

# 8 Conclusion

Our framework captures splitting calculi and provers in a general way, independently of the base calculus. Users can conveniently derive a dynamic refutational completeness result for a splitting prover based on a given statically refutationally complete calculus. As we developed the framework, we faced some tension between constraining the SAT solver's behavior and the saturation prover's. It seemed preferable to constrain the prover, because the prover is typically easier to modify than an off-the-shelf SAT solver. To our surprise, we discovered counterexamples related to locking, formula selection, and simplification, which may affect Vampire's AVATAR implementation, depending on the SAT solver used. We proposed some restrictions, but alternatives could be investigated.

We found that labeled splitting can be seen as a variant of AVATAR where the SAT solver follows a strict strategy and propositional variables are not reused across branches. A benefit of the strict strategy is that locking preserves completeness. As for the relationship between AVATAR and SMT, there are some glaring differences, including that splitting is necessary to support disjunctions in SMT but fully optional in AVATAR. For future work, we could try to complete the picture by considering other related architectures [4–6, 14].

Acknowledgment. Petar Vukmirović greatly helped us design the abstract notions related to A-formulas. Giles Reger patiently explained AVATAR and revealed some of its secrets. Simon Cruanes did the same regarding lightweight AVATAR. Simon Robillard, Andrei Voronkov, Uwe Waldmann, Christoph Weidenbach discussed splitting with us. Haniel Barbosa, Pascal Fontaine, Andrew Reynolds, and Cesare Tinelli explained some fine points of SMT. Natarajan Shankar pointed us to his work on the Shostak procedure. Ahmed Bhayat, Mark Summerfield, Dmitriy Traytel, Petar Vukmirović, and the anonymous reviewers suggested textual improvements. We thank them all.

This research has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 713999, Matryoshka). The research has also received funding from the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO) under the Vidi program (project No. 016.Vidi.189.037, Lean Forward).

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Integer Induction in Saturation**

Petra Hozzov´a<sup>1</sup> , Laura Kov´acs<sup>1</sup> , and Andrei Voronkov2,<sup>3</sup>

<sup>1</sup> TU Wien, Vienna, Austria

{petra.hozzova, laura.kovacs}@tuwien.ac.at, andrei@voronkov.com <sup>2</sup> University of Manchester, Manchester, UK <sup>3</sup> EasyChair, Manchester, UK

**Abstract.** Integers are ubiquitous in programming and therefore also in applications of program analysis and verification. Such applications often require some sort of inductive reasoning. In this paper we analyze the challenge of automating inductive reasoning with integers. We introduce inference rules for integer induction within the saturation framework of first-order theorem proving. We implemented these rules in the theorem prover Vampire and evaluated our work against other state-of-the-art theorem provers. Our results demonstrate the strength of our approach by solving new problems coming from program analysis and mathematical properties of integers.

#### **1 Introduction**

One of the most commonly used data types in imperative/functional programs are integers. For example, iterating over arrays in imperative programs or recursively computing sums in functional programs include integer-valued program variables, as illustrated in Figure 1. While for many uses of integers in programming we only need to consider non-negative integers, there are also applications where integers are essential, for example, reasoning about memory. To formally prove functional correctness of such and similar programs, reasoning about integers is indispensable but so is handling some sort of induction over integers. In this paper we address these two reasoning challenges and fully automate inductive reasoning with integers within saturation-based theorem proving.

Induction in saturation-based theorem proving is a new exciting direction in the automation of induction, recently introduced in [5, 10, 16]. This work focused on induction on inductively defined data types, also called algebraic data types [12], such as natural numbers or lists. However, automating *integer induction*, that is, induction on integers, has not yet been addressed sufficiently.

While natural numbers have a well-founded order and induction over this order is very useful in automated inductive theorem proving, the standard order on integers is not well-founded, so it cannot be directly used as the induction ordering. In this paper we will use the observation that the standard ordering < is well-founded on every set of integers having a lower bound b and likewise, the inverse > of this ordering is well-founded on every set of integers having an upper bound b. This gives us two induction rules on such integer subsets: induction (with the base case b) using < and induction (with the base case b) using >, respectively, to prove that a property holds for all integers ≥ b and, respectively, ≤ b. We define these induction rules as *upward, respectively, downward induction rules with symbolic bounds*. We also consider two variations of these rules over integer intervals and refer to such rules as *interval upward, respectively, downward induction rules with symbolic bounds*.

For natural numbers, 0 is an obvious base case candidate, which also turns out to be successful in the theorem proving practice. It is also a natural base case candidate for induction. In this paper we will give some natural problems for which neither 0 nor any concrete integer is a good base case. Our paper focuses on the following three issues:


This paper is organized as follows. In Section 2 we illustrate our approach by considering properties of the functional and imperative programs of Figure 1. Then in Section 3 we define four induction rules over integers, called *(interval) downward, respectively upward, induction rules with symbolic bounds*, and prove their soundness. Section 4 introduces an extension of superposition calculus by our new integer induction rules. We demonstrate that, using this extension, superposition provers can prove integer properties similarly to how humans would do. This extension is especially successful when used together with the AVATAR architecture [19], since AVATAR helps in reasoning efficiently using constraints coming out of the integer induction rules.

We implemented our work in the Vampire theorem prover [13] and compare our implementation with other relevant provers, including Vampire without integer induction (Section 5). Our experiments show that integer induction can solve many new problems that could not so far be solved by any prover. For example, 75 problems coming from program analysis and/or mathematical integer properties could be solved only by Vampire with the new induction rules.

*Contributions.* This paper makes the following contributions:


**fun** sum(n, m) = **if** n = m **then** n **else** n + sum(n + 1, m); **assert** <sup>∀</sup>n, m <sup>∈</sup> <sup>Z</sup>.(<sup>n</sup> <sup>≤</sup> <sup>m</sup> <sup>→</sup> 2 · sum(n, m) = m ·(m + 1) − n ·(n − 1)) (a) Sum of integers from [n, m]. **assume** 0 ≤ pos < A.size i := pos; **while** i + 1 < A.size **do** A[i + 1] := A[i]; i := i + 1; **inv** <sup>∀</sup><sup>j</sup> <sup>∈</sup> <sup>Z</sup>.(pos <sup>≤</sup> j<i <sup>→</sup> valA(<sup>j</sup> + 1) = valA(j)) **end assert** <sup>∀</sup><sup>j</sup> <sup>∈</sup> <sup>Z</sup>.(pos <sup>≤</sup> j < <sup>A</sup>.size <sup>→</sup> valA(j) = valA(pos)) (b) Array initialization, with valA(j) denoting A[j].

**Fig. 1.** Motivating examples for inductive reasoning with integers.

## **2 Motivating Examples**

#### **2.1 Preliminaries**

We assume familiarity with standard many-sorted first-order logic with equality. For details we refer to [13]. Throughout this paper we denote variables by x, y, e, j, n, m, constants by c, c , Skolem constants by σ, all possibly with indices. We denote terms by t, literals by L, formulas by F and clauses by C. We denote the equality predicate by = and write t<sup>1</sup> = t<sup>2</sup> for the literal ¬(t<sup>1</sup> = t2).

We will focus on integer induction. To this end, we assume a distinguished *integer sort*, denoted by <sup>Z</sup>. When we use standard integer predicates <sup>&</sup>lt;, <sup>≤</sup>, <sup>&</sup>gt;, ≥, functions +, −,... and constants 0, 1, 2,... , we assume that they denote the corresponding interpreted integer predicates and functions with their standard interpretations. All other symbols are uninterpreted. We will write quantifiers like <sup>∀</sup><sup>x</sup> <sup>∈</sup> <sup>Z</sup> to denote that <sup>x</sup> has the integer sort.

In what follows, we will sometimes write "this problem requires integer induction". This should not be regarded as a formal statement: this property is not easy to formalize in general and it is possible that some of these problems can be proved by certain combinations of decision procedures, first-order theorem proving with uninterpreted functions, and axiomatization of interpreted functions on integers. However, when we make such statements, one can see that these problems have relatively simple proofs involving induction and cannot be proved by existing provers without induction.

#### **2.2 Examples**

To illustrate problems arising in automating integer induction, let us consider the programs of Figure 1. Properties of both programs are specified using assertions expressed in first-order logic, with pre- and post-conditions specified by the keywords **assume** and **assert**, respectively.

*Functional programs.* The ML-style functional program of Figure 1(a) computes the sum sum(n, m) of integers in the interval [n, m], that is m <sup>i</sup>=<sup>n</sup> i, where m ≥ n. The function definition uses the following axioms of sum:

$$\forall n \in \mathbb{Z}. (\text{sum}(n, n) = n); \tag{1}$$

$$\forall n, m \in \mathbb{Z}. (n \neq m \to \texttt{sum}(n, m) = n + \texttt{sum}(n+1, m)). \tag{2}$$

We should prove the assertion

$$\forall n, m \in \mathbb{Z}. (n \le m \to 2 \cdot \mathfrak{sum}(n, m) = m \cdot (m + 1) - n \cdot (n - 1)). \tag{3}$$

Formally proving (3) requires inductive reasoning with both integers and quantifiers. Let F[x] be a formula with one or more occurrences of an integer variable x and b an integer term not containing x. Consider the following formula:

$$F[b] \land \forall x \in \mathbb{Z}. (x \le b \land F[x] \to F[x-1]) \to \forall x \in \mathbb{Z}. (x \le b \to F[x]). \tag{4}$$

This formula is valid. It is similar to the standard induction on natural numbers, yet with two essential differences. First, we use x−1 instead of x+ 1 and second, we use the term b where for the standard induction we would use 0. Note that b does not have to be a concrete integer, it can be any term. In the sequel we will refer to such terms b used in induction rules as *symbolic bounds*.

For proving (3) using a theorem prover, we first negate and skolemize (3), obtaining the following formula, where σn, σ<sup>m</sup> are fresh skolem constants:

$$
\sigma\_n \le \sigma\_m \land 2 \cdot \text{sum}(\sigma\_n, \sigma\_m) \ne \sigma\_m \cdot (\sigma\_m + 1) - \sigma\_n \cdot (\sigma\_n - 1) \tag{5}
$$

Modern theorem provers implementing linear integer arithmetic and quantifiers can prove unsatisfiability of (1), (2) and (5) in a relatively straightforward way if we also add an instance of induction rule (4) with

$$\begin{array}{c} F[x] \stackrel{\text{def}}{=} 2 \cdot \mathsf{sum}(x, \sigma\_m) = \sigma\_m \cdot (\sigma\_m + 1) - x \cdot (x - 1); \\\ b \stackrel{\text{def}}{=} \sigma\_m. \end{array}$$

Here and in the sequel def = means "equal by definition" or "defined as". If we want to automate this kind of reasoning, the main question is finding the corresponding instance of induction rule (4), that is, finding the induction formula F[x] and the (symbolic) bound b.

*Imperative programs.* The C-style imperative program of Figure 1(b) initializes an integer-valued array A starting at the index *pos*. We should prove the assertion stating that all array elements at indices greater than or equal to *pos* are equal to each other. Proving such assertions typically requires loop invariants "summarizing" the loop behavior. One such invariant I is shown in the loop after the keyword **inv**. This invariant I could be derived by existing approaches to invariant generation [8, 9].

The assertion of Figure 1(b) is then proved using I, by establishing that the post-condition

$$\forall j \in \mathbb{Z}. (pos \le j < \text{A.size} \to \text{va} \mathbb{1}\_{\text{A}}(j) = \text{va} \mathbb{1}\_{\text{A}}(pos)) \tag{6}$$

is a logical consequence of the invariant I and the negation of the loop condition:

$$\begin{array}{l} \forall j \in \mathbb{Z}. (pos \le j < i \to \mathbf{va1}\_{\mathcal{A}}(j+1) = \mathbf{va1}\_{\mathcal{A}}(j)); \\\neg (i+1 < \mathbf{A}. \text{size}). \end{array} \tag{7}$$

Interestingly, modern theorem provers cannot perform such proofs. Similar to the first example, we can use an induction rule for integers formulated as follows:

$$\begin{array}{c} \left( F[b\_1] \land \forall x \in \mathbb{Z}. (b\_1 \le x < b\_2 \land F[x] \to F[x+1]) \right) \\ \to \forall x \in \mathbb{Z}. (b\_1 \le x \le b\_2 \to F[x]). \end{array} \tag{8}$$

If we add an instance of this rule defined as follows:

$$\begin{array}{c} F[x] \stackrel{\text{def}}{=} \operatorname{va1}\_{\mathcal{A}}(x) = \operatorname{va1}\_{\mathcal{A}}(pos); \\ b\_{1} \stackrel{\text{def}}{=} pos; \\ b\_{2} \stackrel{\text{def}}{=} \operatorname{A.size} - 1, \end{array}$$

then state-of-the-art theorem provers can easily prove that (6) is a logical consequence of (7) and the corresponding instance of (8). For example, Cvc4 [1], Z3 [6] and Vampire prove such an instance in essentially no time. However, similarly to the example of Figure 1(a), in order to find such proofs automatically using the induction rule of (8), we need to be able to discover, during the proof search, the induction formula F[x] and the symbolic bounds b1, b2. In what follows, we describe our solution to automating this discovery by integrating integer induction within saturation-based theorem proving.

#### **3 Integer Induction**

In this section we define four induction rules, or induction schemas, on integers. Two of them were already considered in Section 2 – namely (4) and (8).

**Definition 1 (Downward/Upward Induction).** A *downward, respectively upward, induction axiom with symbolic bounds* is any formula of the form

$$\begin{array}{l} F[b] \wedge \forall x. (x \le b \wedge F[x] \to F[x-1]) \to \forall x. (x \le b \to F[x]); & (downward) \\ F[b] \wedge \forall x. (x \ge b \wedge F[x] \to F[x+1]) \to \forall x. (x \ge b \to F[x]), & (upward) \end{array}$$

respectively, where F[x] is a formula with one or more occurrences of an integer variable x and b is an integer term not containing x. 

Note that (4) is a downward induction axiom with symbolic bounds.

**Definition 2 (Interval Downward/Upward Induction).** An *interval downward, respectively upward, induction axiom with symbolic bounds* is any formula of the form

F[b2] ∧ ∀x.(b<sup>1</sup> < x ≤ b<sup>2</sup> ∧ F[x] → F[x − 1]) → ∀x.(b<sup>1</sup> ≤ x ≤ b<sup>2</sup> → F[x]); (down.) F[b1] ∧ ∀x.(b<sup>1</sup> ≤ x<b<sup>2</sup> ∧ F[x] → F[x + 1]) → ∀x.(b<sup>1</sup> ≤ x ≤ b<sup>2</sup> → F[x]), (up.)

respectively, where F[x] is a formula with one or more occurrences of an integer variable x and b1, b<sup>2</sup> are integer terms not containing x. 

Note that (8) is an interval upward induction axiom with symbolic bounds. The main motivation for interval induction rules is their utility in reasoning about loops, as illustrated by the example of Figure 1(a). While interval induction can be captured by induction with one bound, it would require additional case analysis, which is not efficient in saturation-based proving practice.

In the sequel, we will refer to the integer terms of b, b1, b<sup>2</sup> from Definitions 1-2 as *symbolic bounds* and the formulas F[x] from the induction axioms of Definitions 1-2 as *induction formulas*.

**Definition 3 (Downward/Upward Induction Rules).** The *downward (respectively, upward) induction rule with symbolic bounds*, or simply *downward (respectively, upward) induction rule* is the inference rule whose instances are all downward (respectively, upward) induction axioms with symbolic bounds.

Likewise, the *interval downward (respectively, upward) induction rule with symbolic bounds*, or simply *interval downward (respectively, upward) induction rule* is the inference rule whose instances are all interval downward (respectively, upward) induction axioms with symbolic bounds. 

It is easy to see that the following theorem holds.

**Theorem 1 (Soundness).** *The (interval) downward/upward induction rules of Definition 3 are sound, that is, all corresponding induction axioms from Definitions 1-2 are valid.* 

#### **4 Integer Induction in Saturation-Based Proof Search**

Our next aim is to define analogues of the induction rules introduced in Section 3 that can be used in superposition theorem provers and their saturation algorithms. For a general discussion of superposition and saturation we refer to [13]. In this section we use to denote the empty clause and write CNF(F) to mean (any) clausal normal form of a formula F. We refer to the set of clauses on which a saturation algorithm operates as the *search space*.

The most general way to introduce our new induction rules at the calculus level is to add clausal forms of our new induction axioms to the search space. That is, for every induction axiom F from Section 3, we add the rule

CNF(F)

.

However, we cannot efficiently implement such a calculus, as any formula with one variable can be used as an induction formula. We will therefore introduce different, more specialized, rules, which still correspond to the previously defined induction rules. The new rules use variations of the following three ideas:


The first two ideas were already used in the first papers underlying our approach to induction in saturation theorem proving [10, 16]. For example, they can be implemented by using only induction formulas that are obtained from ground literals L[t] in the search space, where t is a ground term. The corresponding induction formula will be ¬L[x]. The idea is that, when we prove the induction formula, ¬L[x] will be resolved against L[t].

The third idea is new. Note that, if we use the first two ideas and the upward induction rule, instead of ¬L[x] we will derive b ≤ x → ¬L[x]. When we resolve this against L[t], we obtain the clause ¬(b ≤ t). However, if we already previously derived b ≤ t, we can also resolve away ¬(b ≤ t). This gives us the idea to only apply the upward induction rules when we have b ≤ t. 4

Based on the three ideas above, we introduce the following four induction rules on clauses. In these rules t is a ground term, b is a constant and L[x] is a literal containing at least one occurrence of a variable x and no other variables. The rules depend on which comparisons among t ≥ b, t>b, t ≤ b and t<b already occur in the current search space:

$$\begin{array}{c} \neg L[t] \lor C \qquad t \ge b \\ \hline \text{CNF} \left( \left( L[b] \land \forall x. (x \ge b \land L[x] \to L[x+1]) \right) \to \forall y. (y \ge b \to L[y]) \right) \\\\ \neg L[t] \lor C \qquad t > b \\ \hline \text{CNF} \left( \left( L[b] \land \forall x. (x \ge b \land L[x] \to L[x+1]) \right) \to \forall y. (y > b \to L[y]) \right) \\\\ \neg L[t] \lor C \qquad t \le b \\ \hline \text{CNF} \left( \left( L[b] \land \forall x. (x \le b \land L[x] \to L[x-1]) \right) \to \forall y. (y \le b \to L[y]) \right) \\\\ \neg L[t] \lor C \qquad t < b \\ \hline \text{CNF} \left( \left( L[b] \land \forall x. (x \le b \land L[x] \to L[x-1]) \right) \to \forall y. (y < b \to L[y]) \right) \\\\ \text{CNF} \left( \left( L[b] \land \forall x. (x \le b \land L[x] \to L[x-1]) \right) \to \forall y. (y < b \to L[y]) \right) \end{array}$$

Note that IntInd<sup>≥</sup> and IntInd<sup>&</sup>gt; are upward induction rules, whereas IntInd<sup>≤</sup> and IntInd<sup>&</sup>lt; are downward induction rules. One can also introduce non-ground analogues of these rules but we do not consider them in this paper.

<sup>4</sup> Using the AVATAR architecture [19], we can easily obtain valid literals <sup>b</sup> <sup>≤</sup> <sup>t</sup>.

Similarly to the above rules on the clausal level, we also introduce the interval upward/downward induction rules on clauses to be used in saturation algorithms for the superposition calculus. Since these rules are similar to each other, here we only define one rule IntInd[≥] for interval upward induction. For a ground term t, constants b1, b2, and L[x] a literal containing at least one occurrence of a variable x and no other variables, an interval upward induction rule on clauses:

$$\begin{aligned} \frac{\neg L[t] \lor C \qquad t \ge b\_1 \qquad t \le b\_2}{\text{CNF}\left(\left(L[b\_1] \land \forall x. (b\_1 \le x < b\_2 \land L[x] \to L[x+1])\right)\right)}\\ \rightarrow \forall y. (b\_1 \le y \le b\_2 \to L[y]) \end{aligned} \text{ (Int}\,\text{Ind}\_{\left[\geq 1\right]})$$

In view of Theorem 1, all induction rules of Section 3 are sound. Assuming that our CNF function preserves satisfiability, we conclude that all our induction rules IntInd≥, IntInd>, IntInd≤, IntInd<sup>&</sup>lt; and IntInd[≥] on the clausal level are sound.

**Theorem 2 (Soundness).** *For every satisfiability preserving CNF function, the induction rules from Definition 3 are sound.* 

*Example 1.* To illustrate again how the choice of induction formulas allows us to have shorter clauses, consider IntInd≤. The CNF in its conclusion consists of three clauses:

$$\begin{array}{l} \neg L[b] \lor \sigma \le b \lor \neg y \le b \lor L[y] \\\neg L[b] \lor L[\sigma] \lor \neg y \le b \lor L[y] \\\neg L[b] \lor \neg L[\sigma - 1] \lor \neg y \le b \lor L[y] \end{array} \tag{9}$$

These clauses can be resolved against premises of IntInd≤, yielding the following clauses:

$$\begin{array}{l}\neg L[b] \lor \sigma \le b \lor C\\\neg L[b] \lor L[\sigma] \lor C\\\neg L[b] \lor \neg L[\sigma-1] \lor C\end{array} \tag{10}$$

They have an especially simple form when C is the empty clause -. In this case we have three clauses:

$$\begin{array}{l}\neg L[b] \lor \sigma \le b\\\neg L[b] \lor L[\sigma] \\\neg L[b] \lor \neg L[\sigma-1] \end{array} \tag{11}$$

which subsume the original three longer clauses and are ground. Since they are ground, they can be handled efficiently by AVATAR. 

*Example 2.* Let us now demonstrate how the downward induction rule IntInd<sup>≤</sup> works for refuting the inductive property (3) from our motivating example of Figure 1(a). We use literals from (5) as the premises of the IntInd<sup>≤</sup> rule. The corresponding instance of the downward induction rule is defined by

$$\begin{aligned} b &\stackrel{\text{def}}{=} \sigma\_m; \\ t &\stackrel{\text{def}}{=} \sigma\_n; \\ L[x] &\stackrel{\text{def}}{=} 2 \cdot \text{sum}(x, \sigma\_m) = \sigma\_m \cdot (\sigma\_m + 1) - x \cdot (x - 1). \end{aligned}$$

This instance of IntInd<sup>≤</sup> is:

$$\begin{array}{cc} 2 \cdot \mathsf{sum}(\sigma\_n, \sigma\_m) \neq \sigma\_m \cdot (\sigma\_m + 1) - \sigma\_n \cdot (\sigma\_n - 1) & \sigma\_n \leq \sigma\_m \\ \hline \end{array} \quad \begin{array}{c} \sigma\_n \leq \sigma\_m \\ \end{array} \quad \begin{array}{c} \sigma\_n \leq \sigma\_m \\ \end{array} \quad \begin{array}{c} \left(\mathsf{Int}.\mathsf{Ind}\_{\leq}\right) \\ \right. \\ \left. \left(\begin{array}{c} \sigma\_n \leq \sigma\_m \rightarrow 2 \cdot \mathsf{sum}(x, \sigma\_m) = \sigma\_m \cdot (\sigma\_m + 1) - x \cdot (x - 1) \\ \rightarrow \\ \end{array} \right. \\ \left. \left(\begin{array}{c} \sigma\_n \leq \sigma\_m \rightarrow 2 \cdot \mathsf{sum}(x, \sigma\_m) = \sigma\_m \cdot (\sigma\_m + 1) - (x - 1) \cdot ((x - 1) - 1) \\ \rightarrow \end{array} \right) \right) \\ \rightarrow \forall y. \big(y \leq \sigma\_m \rightarrow 2 \cdot \mathsf{sum}(y, \sigma\_m) = \sigma\_m \cdot (\sigma\_m + 1) - y \cdot (y - 1) \Big) \\ \end{array}$$

This single instance of the induction rule does the magic. By adding its conclusion to the search space we can obtain a contradiction in a few steps by applying a few superposition rules and using ground reasoning in linear integer arithmetic with uninterpreted functions (as evidenced by the results for the first problem subset, *x all* of *sum*, in Table 3).

We finally note that functional correctness of Figure 1(b) is proved by the interval upward induction rule IntInd[≥], in a similar way as above (and as evidenced by the results of Table 3 for *declared unint ax-fin conj-fin* in *val*). 

What we find especially interesting in Example 2 is that the induction axiom used in it (and discovered by our implementation of induction in Vampire) uses the induction argument that would probably be used by a majority of humans who would try to argue why the program property holds.

#### **5 Implementation and Experiments**

#### **5.1 Implementation**

We implemented our integer induction rules IntInd≥, IntInd>, IntInd≤, IntInd<sup>&</sup>lt; as well as IntInd[≥] and the other corresponding interval induction rules in Vampire. Further, we also implemented a more general induction rule IntInd that does not require bounds to be in the search space and uses 0 as the lower or the upper bound. Our implementation in Vampire, consisting of approximately 1,200 lines of new C++ code, is available at https://github.com/vprover/vampire. The size of this additional code is relatively small because Vampire has libraries for indexing and chaining inference rules that could be used off the shelf.

Our (interval) downward/upward induction rules described in Section 4 can be applied when either (i) the comparison literal (e.g., t ≥ b for the IntInd<sup>≥</sup> rule) is selected and the corresponding clause ¬L[t] ∨ C was already selected as an induction candidate before, or (ii) if ¬L[t] ∨ C is selected as an induction candidate and the corresponding comparison literal was already selected before. To implement these rules efficiently, we should be able to efficiently retrieve comparison literals and literals selected for induction. To do so, we extended the indexing mechanism of Vampire to index such literals. We do not apply induction when the induction formula L[x] is a comparison having x as a top level argument, for example, x ≤ t, and allow to apply it to all other induction formulas deemed to be suitable by other user-specified options.

370 P. Hozzov´a, L. Kov´acs and A. Voronkov

```
assume e ≥ 1
fun power(x, 1) = x
   | power(x, e) = x · power(x, e − 1);
assert ∀x, y ∈ Z.(power(x · y, e) = power(x, e) · power(y, e))
```
**Fig. 2.** ML-like functional program computing integer powers for positive exponents.

Our (interval) downward/upward induction rules in Vampire are enabled by the new option --induction int. The options --int induction interval infinite and --int induction interval finite limit the enabled rules to downward/upward only, and interval downward/upward only, respectively. Further, --int induction default bound on enables the more general rule which does not require bounds to be in the search space. Our new induction rules can also be controlled by other Vampire options for well-founded/structural induction, such as --induction on complex terms on, which enables applying induction on any ground complex term. To improve Vampire's performance for integer induction, we combined our new induction rules with --induction on complex terms on and also other options not specific for induction. We extended Vampire with a new mode scheduling various option configurations for integer induction, switched on by the option --mode portfolio --schedule integer induction. Additionally, we introduced the option --schedule induction which uses either the integer induction configurations as for --schedule integer induction, or structural induction configurations, or both, depending on the data types used in the problem/property to be proved.

#### **5.2 Benchmarks**

We used two sets of examples: (i) benchmark sets LIA and UFLIA from the SMT-LIB collection [2], consisting of, respectively, 607 and 10,137 examples, and (ii) 120 new benchmarks similar to our motivating examples from Section 2.

To the best of our knowledge, the state-of-the-art systems implementing inductive reasoning have so far not yet considered inductive reasoning over integers, with two exceptions: [17], which mainly focuses on induction over inductively defined data types but mentions induction on non-negative integers and [11], which supports inductive reasoning using recursive function definitions without any special treatment for integers.

Since integer induction has not yet attracted enough attention in theorem proving, there is no significant collection of benchmarks for integer induction. To properly carry out experiments, we therefore created a set of *120 new benchmarks* based on variations of our motivating examples from Section 2 and on properties of computing integer powers. One example is the function correctness of the


**Table 1.** Description of our benchmark set of 120 new examples.

program of Figure 2, which is formalized as follows:

axioms: <sup>∀</sup><sup>x</sup> <sup>∈</sup> <sup>Z</sup>.(power(x, 1) = <sup>x</sup>)

$$\forall x, e \in \mathbb{Z}. (2 \le e \to \mathtt{power}(x, e) = x \cdot \mathtt{power}(x, e - 1))\tag{12}$$

$$\dots \quad \dots \quad \dots \quad \dots \quad (1 \le e \le \mathtt{\dots\dots\dots\dots\dots\dots\dots})\qquad \dots \quad \dots \quad (\dots \quad) \quad \dots \quad \dots \quad (\dots \quad)$$

conjecture: ∀x, y, e.(1 ≤ e → power(x · y, e) = power(x, e) · power(y, e))

Our set of 120 new benchmarks is described in Table 1 and available online at:

#### https://github.com/vprover/inductive\_benchmarks

To confirm that our new benchmarks require the use of inductive reasoning, we tested them on the SMT solver Z3 [6] that does not support induction. Z3 could not solve any of the 120 problems from our benchmark set. Names of subsets of our new benchmarks are constructed by joining variant tags described in Table 1. For example, problem (6) belongs to the category *declared unint axfin conj-fin* of the set *val*. The following benchmark:

$$\begin{aligned} \text{axiom:} & \quad \forall x \in \mathbb{Z}. (\mathbf{va1}(x) = \mathbf{va1}(x+1)) \\ \text{xconjecture:} & \quad \forall x, y \in \mathbb{Z}. (\mathbf{va1}(x) = \mathbf{va1}(y)) \end{aligned} \tag{13}$$

belongs to *declared unint ax-all conj-all* of *val* and the below example is from *defined inter ax-geq conj-geq* of *val*:

$$\begin{aligned} \text{axioms:} \quad & \forall x \in \mathbb{Z}. (x \le 0 \to \mathbf{va1}(x) = 0) \\ & \forall x \in \mathbb{Z}. (0 < x \to \mathbf{va1}(x) = \mathbf{va1}(x - 1)) \\ \text{conjecture:} \quad & \forall x \in \mathbb{Z}. (0 \le x \to \mathbf{va1}(x) = \mathbf{va1}(0)) \end{aligned} \tag{14}$$

While 9 of the benchmarks (all in *val*) use finite intervals in both the assertion and the invariant (*ax-fin conj-fin*), the remaining 111 benchmarks require inductive reasoning over infinite intervals.


**Table 2.** Comparison of solvers on SMT-LIB benchmarks.

#### **5.3 Experimental Setup**

We ran our experiments on computers with 32 cores (AMD Epyc 7502, 2.5 GHz) and 1 TB RAM. In all experiments we used the memory limit of 16 GB per problem. For the new benchmarks we used a 300 seconds time limit. For the experiments on the larger LIA and UFLIA sets we used a 10 seconds time limit.

In what follows, Vampire refers to the (default) version of Vampire, as in [10, 16]. By Vampire-I we denote our new version of Vampire, using integer induction rules (--induction int). Vampire-I\* refers to the portfolio mode of Vampire-I, scheduling various option configurations for integer induction (--mode portfolio --schedule induction).

For *experiments with the new benchmarks*, we note that Vampire without integer induction cannot solve any of the problems. In this set of experiments, we therefore compared Vampire-I to the provers Cvc4 [17] and Acl2 [11], which are, to the best of our knowledge, the only two automated solvers supporting inductive reasoning with integers in addition to reasoning with theories and quantifiers. For Cvc4, we used the *ig* configuration from [17]: --quant-ind --quant-cf --conjecture-gen --conjecture-gen-per-round=3 --full-saturate-quant. For Acl2, we used its default configuration and translated our new problem set into the functional program encoding syntax of Acl2. In the *experiments with the LIA and UFLIA benchmark sets of SMT-LIB*, we also used Z3 [6] in the default configuration.

We ran Cvc4, Z3, Vampire and Vampire-I on problems encoded in the SMT-LIB2 syntax [2]. For running Acl2 on the new benchmarks, we translated problems into the functional program encoding syntax of Acl2.

#### **5.4 Experimental Results**

*SMT-LIB Benchmarks.* First, we evaluated the improvements of integer induction in Vampire-I when compared to Vampire, Cvc4 and Z3 on the LIA and UFLIA sets of SMT-LIB [2]. We aimed to verify that Vampire-I's performance does not deteriorate due to adding integer induction, check whether Vampire-I can solve problems that could not be solved automatically before, and to identify the best values for options related to integer induction. To this end, we picked five different strategies (e.g. using different saturation algorithms and selection functions) and used different combinations of induction options. Table 2 summarizes our results, showcasing that integer induction enabled Vampire-I to


**Table 3.** Experiments with our new benchmarks from Table 1.

solve over 100 new problems that Vampire could not solve before (last but one column of Table 2). Moreover, 45 of these problems were also new compared to Cvc4 and Z3 (last column of Table 2), which most likely means that no theorem prover was able to prove them before.

In problems solved using integer induction, the integer induction rules were applied often: at least one of the interval induction rules was used in nearly 99% of problems, while one of the induction rules with one bound was used in nearly all problems. The interval induction and induction rules were used on average 4559 and 1191 times, respectively. 89% of the proofs employed interval induction (67% upward, 29% downward), while 27% of the proofs used induction with one bound (22% upward, 8% downward). Additionally, over 64% of proofs only required one application of any induction rule.

*Experiments with 120 New Benchmarks.* Comparison results for Vampire-I, Acl2 and Cvc4 on our new benchmarks are displayed in Table 3, aggregated by benchmark subsets, as described in Table 1. We do not show Vampire in the table, since without integer induction it cannot solve any of the problems.

The results show that in some cases Acl2 can perform upward and downward induction on integers, but only when using interpreted constants as a base case (that is, it cannot handle symbolic bounds). However, it can only do so if it also proves termination of the recursively defined function. It also has issues with reasoning about multiplication.

Cvc4 has limited support for integer induction: it can apply upward induction but only when the base case is an interpreted constant. Since some problems seem to require induction with symbolic bounds, Cvc4 is mostly able to either solve all problems in a subset, or none of them. The only exception is the subset *declared mixed ax-fin conj-fin*, in which Cvc4 solves one problem, which can be solved using upward induction with an interpreted constant as the base case.

Vampire-I\* does not have any conceptual problems with solving the benchmarks. However, since it uses axioms and inference rules rather than dedicated decision procedures for handling integers, it sometime has issues with solving problems with large integer values. For example, for the infinite interval subset of the *val* benchmark set, the only problems Vampire-I\* did not solve were those containing the interpreted constant 100 or -100. Similarly, in the *power* benchmark set, the unsolved problems contained large numbers. Finally, in the *declared mixed ax-fin conj-fin* subset, the two problems Vampire-I\* did not solve also required more sophisticated arithmetic reasoning. However, inability of efficiently dealing with large numbers is not an intrinsic problem of superposition theorem provers. Reasoning with quantifiers and theories is still in its infancy and major improvements are underway. For example, there are recent parallel developments in superposition and linear arithmetic [15] that should improve this kind of reasoning in Vampire.

#### **6 Related Work**

Previous works on automating induction mainly focused on inductive reasoning for inductively defined data types, for example in inductive theorem provers Acl2 [11], IsaPlanner [7], HipSpec [4], Zeno [18] and Imandra [14]; superposition theorem provers Zipperposition [5] and Vampire [16]; and the SMT solver Cvc4 [17]. While most of these solvers support reasoning with integers, only Acl2 and Cvc4 implement some form of induction over integers.

The Acl2 approach [11] generates induction schemas based on recursive function calls in the property to be proved. Hence, it can only use induction to solve problems properties of recursively defined functions. On the other hand, the SMT-based setting of Cvc4 [17] applies induction by inductive strengthening of SMT properties in combination with subgoal discovery. As noted in Section 5, Cvc4 is limited to induction with concrete base cases and upward induction.

While downward integer induction can be considered a straightforward generalization of upward integer induction and does not solve many more problems in our benchmark sets, symbolic bounds provide a very powerful generalization, as witnessed by experimental results. In automated reasoning, the power provided by more general rules comes with the price of uncontrollable blowup of the search space. To harness this power we came up with defining (interval) upward/downward induction rules with symbolic bounds in the superposition calculus in such a way that they result in most cases in the addition of very simple clauses, which can be efficiently handled within the AVATAR architecture.

We believe that variants of our induction rules defined in Section 4 can also be successfully used by SMT solvers. The idea is to apply them, like we do, only when there is a suitable bound in the current candidate model. One can also combine this with the observation made in Example 1: one can resolve added induction formulas against literals already occurring in the search space to add only ground formulas.

The benchmark suite we propose and use in this paper is new and can be used to complement existing benchmarks: the TIP library [3] and the examples of [17]. Our 120 new examples are however more focused on integer properties, whereas [3, 17] contain a variety of problems mostly requiring induction over inductively defined types. Specifically, out of more than 500 inductive problems in TIP [3], only 3 use integers and no inductive data types. The examples from [17] contain 311 inductive benchmarks translated into three encodings, (i) using only inductive data types, (ii) using integers instead of natural numbers, but also other inductive data types (such as lists or trees), and (iii) using both integers and natural numbers to express the same properties, alongside other inductive data types. Problems from (iii) are also included in SMT-LIB [2]. Note that there is a substantial difference between our benchmarks and benchmarks from (ii). The latter mostly require inductive reasoning only for inductive data types (or no induction at all): they contain integers but only a few of them require inductive reasoning over integers, while most of our benchmarks require proper integer induction. For example, Vampire can solve 131 of 306 benchmarks in (ii) without using integer induction.

## **7 Conclusions**

We introduced new inference rules for automating inductive reasoning with integers within saturation-based theorem proving. Many problems in program analysis and mathematical problems of integers previously unsolvable by any theorem prover can now be solved completely automatically. We believe our results can progress automated program analysis and automation of mathematics, where integers are universally used.

**Acknowledgments.** We thank M´arton Hajd´u and Giles Reger for fruitful discussions. This work was partially funded by the ERC CoG ARTIST 101002685, the ERC StG SYMCAR 639270, the EPSRC grant EP/P03408X/1 and the FWF grant LogiCS W1255-N23.

## **References**


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Superposition with First-class Booleans and Inprocessing Clausification

Visa Nummelin<sup>1</sup> , Alexander Bentkamp<sup>1</sup> , Sophie Tourret2,<sup>3</sup> , and Petar Vukmirović<sup>1</sup>

<sup>1</sup> Vrije Universiteit Amsterdam, Amsterdam, The Netherlands visa.nummelin@vu.nl a.bentkamp@vu.nl p.vukmirovic@vu.nl <sup>2</sup> Université de Lorraine, CNRS, Inria, LORIA, Nancy, France sophie.tourret@inria.fr <sup>3</sup> Max-Planck-Institut für Informatik, Saarland Informatics Campus, Saarbrücken,

Germany

Abstract. We present a complete superposition calculus for first-order logic with an interpreted Boolean type. Our motivation is to lay the foundation for refutationally complete calculi in more expressive logics with Booleans, such as higher-order logic, and to make superposition work efficiently on problems that would be obfuscated when using clausification as preprocessing. Working directly on formulas, our calculus avoids the costly axiomatic encoding of the theory of Booleans into first-order logic and offers various ways to interleave clausification with other derivation steps. We evaluate our calculus using the Zipperposition theorem prover, and observe that, with no tuning of parameters, our approach is on a par with the state-of-the-art approach.

# 1 Introduction

Superposition is a calculus for equational first-order logic that works on problems given in clausal normal form. Its immense success made preprocessing clausification a predominant mechanism in modern automatic theorem proving. However, this preprocessing is not without drawbacks. Clausification can transform simple problems, such as s → s where s is a large formula, in a way that hides its original simplicity from the superposition calculus. Ganzinger and Stuber's superposition-like calculus [13] operates on clauses that contain formulas as well as terms and replaces preprocessing clausification by inprocessing—meaning processing during the operation of the calculus itself. Inprocessing clausification allows superposition's powerful simplification engine to work on formulas. For example, unit equalities can rewrite formulas s and t in s ↔ t before clausification duplicates the occurrences into s → t and t → s. Whole formulas rather than simple literals can be removed by rules such as subsumption resolution [4].

Another issue with Boolean reasoning in the standard superposition calculus is that, in first-order logic, formulas cannot appear inside terms although this is often desirable for problems coming from software verifiers or proof assistants. Instead, authors of such tools need to resort to translations. Kotelnikov et al.

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5\_22 378–395, 2021.

studied effects of these translations in detail. They showed that simple axioms such as the domain cardinality axiom for Booleans (∀(x : o). x ≈ ∨ x ≈ ⊥) can severely slow down superposition provers. To support more efficient reasoning on problems with first-class Booleans, they describe the FOOL logic, which admits functions that take arguments of Boolean type and quantification over Booleans. They further describe two approaches to reason in FOOL: The first one [17] requires an additional rule in the superposition calculus, whereas the second one [16] is completely based on preprocessing.

Our calculus combines complementary advantages of Ganzinger and Stuber's and of Kotelnikov et al.'s work. Following Kotelnikov et al., our logic (Sect. 2) is similar to FOOL and supports nesting formulas inside terms, as well as quantifying over Booleans. Following Ganzinger and Stuber, our calculus (Sect. 3) reasons with formulas and supports inprocessing clausification.

Our calculus also extends the two approaches. To reduce the number of possible inferences, we generalize Ganzinger and Stuber's Boolean selection functions, which allow us to restrict the Boolean subterms in a clause on which inferences can be performed. The term order requirements of our calculus are less restrictive than Ganzinger and Stuber's. In addition to the lexicographic path order (LPO), we also support the Knuth-Bendix order (KBO) [15], which is known to work better with superposition in practice.

Our proof of refutational completeness (Sect. 4) lays the foundation for complete calculi in more complex logics with Booleans. Indeed, Bentkamp et al. [8] devised a refutationally complete calculus for higher-order logic based on our completeness theorem. Our theorem incorporates a powerful redundancy criterion that allows for a variety of inprocessing clausification methods (Sect. 5).

We implemented our approach in the Zipperposition theorem prover (Sect. 6) and evaluated it on thousands of problems that target our logic ranging from TPTP to SMT-LIB to Sledgehammer-generated benchmarks (Sect. 7). Without fine-tuning, our new calculus performs as well as known techniques. Exploring the strategic choices that our calculus opens should lead to further performance improvements. In addition, we corroborate the claims of Ganzinger and Stuber concerning applicability of formula-based superposition reasoning: We find a set of 17 TPTP problems (out of 1000 randomly selected) that Zipperposition can solve only using the techniques described in this paper. We refer to our technical report [25] for more details on our calculus and the complete completeness proof.

#### 2 Logic

Our logic is a first-order logic with an interpreted Boolean type. It is essentially identical to the UF logic of SMT-LIB [5], including the Core theory, but without if-then-else and let expressions, which can be supported through simple translations. It also closely resembles Kotelnikov et al.'s FOOL [17], which additionally supports if-then-else and let expressions.

Our logic requires an interpreted Boolean type o and allows for an arbitrary number of uninterpreted types. The set of symbols must contain the logical

symbols , ⊥ : o; ¬ : o → o; ∧, ∨, → : (o × o) → o; and the overloaded symbols ≈, ≈ : (τ × τ ) → o for each type τ . The logical symbols are printed in bold to distinguish them from the notation used for clauses below. Throughout the paper, we write tuples (a1,...,an) as a¯<sup>n</sup> or a¯.

The set of *terms* is defined inductively as follows. Every variable is a term. If f : ¯τ<sup>n</sup> → υ is a symbol and t ¯<sup>n</sup> : ¯τ<sup>n</sup> is a tuple of terms, then the application f(t ¯n) (or simply f if n = 0) is a term of type υ. If x is a variable and t : o a Boolean term, then the quantified terms ∀x. t and ∃x. t are terms of Boolean type. We view quantified terms modulo α-renaming. A *formula* is a term of Boolean type.

The *root* of a term is f if the term is an application f(t ¯n); it is x if the term is a variable x; and it is ∀ or ∃ if the term is a quantified term ∀x. t or ∃x. t. A variable occurrence is *free* in a term if it is not bound by ∀ or ∃. A term is *ground* if it contains no free variables. Substitutions are defined as usual in first-order logic and they rename quantified variables to avoid capture.

A literal s ≈˙ t is an equation s ≈ t or a disequation s ≈ t. Unlike terms constructed using the function symbols ≈ and ≈, literals are unoriented. A clause L<sup>1</sup> ∨···∨ L<sup>n</sup> is a finite multiset of literals L<sup>j</sup> . The empty clause is written as ⊥. Terms t of Boolean type are not literals. They must be encoded as t ≈ and t ≈ ⊥, which we call *predicate literals*. Both are considered positive literals because they are equations, not disequations.

We have considered excluding negative literals s ≈ t by encoding them as (s ≈ t) ≈ ⊥, following Ganzinger and Stuber. However, this approach requires an additional term order condition to make the conclusion of equality factoring small enough, excluding KBO. To support both KBO and LPO, we allow negative literals. Regardless, our simplification mechanism will allow us to simplify negative literals of the form t ≈ ⊥ and t ≈ into t ≈ and t ≈ ⊥, respectively, thereby eliminating redundant representations of predicate literals.

The semantics is a straightforward extension of standard first-order logic only adding the interpretation of the Boolean type as a two element domain, as in Kotelnikov et al.'s FOOL logic. Some of our calculus rules introduce Skolem symbols, which are intended to be interpreted as witnesses for existentially quantified terms. Still, our semantics treats them as uninterpreted symbols. To achieve a satisfiability-preserving calculus, we assume that these symbols do not occur in the input problem. More precisely, we inductively extend the signature of the input problem by a symbol sk∀y. ¯ <sup>∃</sup>z.t : ¯τ → υ for each term of the form ∃z. t over the extended signature, where υ is the type of z and y¯ : ¯τ are the free variables occurring in ∃z. t, in order of first appearance.

#### 3 The Calculus

Following standard superposition, our calculus employs a term order and a literal selection function to restrict the search space. To accommodate for quantified Boolean terms, we impose additional requirements on the term order. To support flexible reasoning with Boolean subterms, in addition to the literal selection function, we introduce a Boolean subterm selection function.

Term Order The calculus is parameterized by a strict well-founded order  on ground terms that fulfills: (O1) u  ⊥  for any term u that is not or ⊥; (O2) ∀x. t  {x → u}t and ∃x. t  {x → u}t for any term u whose only Boolean subterms are and ⊥; (O3) subterm property; (O4) compatibility with contexts (not necessarily below ∀ and ∃); (O5) totality. The order is extended to literals, clauses, and nonground terms as usual [2]. The nonground order then also enjoys (O6) stability under grounding substitutions.

Ganzinger and Stuber's term order restrictions are similar but incompatible with KBO. Using an encoding of our terms into untyped first-order logic we describe how both LPO and the transfinite variant of KBO [19] can satisfy conditions (O1)–(O6).

Our encoding represents bound variables by De Bruijn indices, which become new constant symbols db<sup>n</sup> for <sup>n</sup> <sup>∈</sup> <sup>N</sup>. Quantifiers are represented by two new unary function symbols, also denoted by ∀ and ∃. All other symbols are simply identified with their untyped counterpart. Regardless of symbol precedence or symbol weights, KBO and LPO enjoy properties (O3)–(O6) when applied to the encoded terms. They are even compatible with contexts below quantifiers.

To satisfy (O1) and (O2), let the precedence for LPO be < ⊥ < f < ∀ < ∃ < db<sup>0</sup> < db<sup>1</sup> < ··· where f is any other symbol. For KBO, we can use the same symbol precedence and a symbol weight function W that assigns each symbol ordinal weights (of the form ωa <sup>+</sup> <sup>b</sup> with a, b <sup>∈</sup> <sup>N</sup>), where <sup>W</sup>() = <sup>W</sup>(⊥) = <sup>1</sup>, <sup>W</sup>(∀) = <sup>W</sup>(∃) = <sup>ω</sup>, and <sup>W</sup>(f) <sup>∈</sup> <sup>N</sup> \ {0} for any other symbol <sup>f</sup>.

Selection and Eligibility Following an idea of Ganzinger and Stuber, we parameterize our calculus with two selection functions: one selecting literals and one selecting Boolean subterms.

Definition 1 (Selection functions). The calculus is parameterized by a literal selection function *FLSel* and a Boolean subterm selection function *FBSel*. The function *FLSel* maps each clause to a subset of its literals. The selection function *FBSel* maps each clause to a subset of its Boolean subterms. The literals *FLSel*(C) and the subterms *FBSel*(C) are *selected* in C. The following restrictions apply: (S1) A literal can only be selected if it is negative or of the form s ≈ ⊥. (S2) A Boolean subterm can only be selected if it is not , ⊥, or a variable. (S3) A Boolean subterm can only be selected if its occurrence is not below a quantifier. (S4) The topmost terms on either side of a positive literal cannot be selected.

The interplay of maximality w.r.t. term order, literal and Boolean selection functions gives rise to a new notion of eligibility:

Definition 2 (Eligibility). A literal L is (*strictly*) *eligible* w.r.t. a substitution σ in C if it is selected in C or there are no selected literals and no selected Boolean subterms in C and σL is (strictly) maximal in σC. The eligible subterms of a clause C w.r.t. a substitution σ are inductively defined as follows: (E1) Any selected subterm is eligible. (E2) If a literal s ≈˙ t with σs σt is either eligible and negative or strictly eligible and positive, then s is eligible. (E3) If a subterm is eligible and its root is not ≈, 6≈, ∀, or ∃, all of its direct subterms are also eligible. (E4) If a subterm is eligible and of the form s ≈ t or s 6≈ t, then s is eligible if σs 6 σt and t is eligible if σs 6 σt. The substitution σ is left implicit if it is the identity substitution.

The Core Inference Rules The following inference rules form our calculus:

D D<sup>0</sup> ∨ t ≈ t <sup>0</sup> C[u] Sup σ(D<sup>0</sup> ∨ C[t 0 ]) C C <sup>0</sup> ∨ u <sup>0</sup> ≈ v <sup>0</sup> ∨ u ≈ v Factor σ(C <sup>0</sup> ∨ v 6≈ v <sup>0</sup> ∨ u ≈ v 0 ) C C <sup>0</sup> ∨ u 6≈ u 0 Irrefl σC<sup>0</sup> C C <sup>0</sup> ∨ s ≈ t ⊥Elim σC<sup>0</sup> C[u] BoolRw σC[t 0 ] C[∀z. v] ∀Rw C[{z 7→ sk<sup>∀</sup>y. ¯ <sup>∃</sup>z.¬v(¯y)}v] C[∃z. v] ∃Rw C[{z 7→ sk<sup>∀</sup>y. ¯ <sup>∃</sup>z.v(¯y)}v] C[u] BoolHoist C[⊥] ∨ u ≈ > C[s ≈ t] ≈Hoist C[⊥] ∨ s ≈ t C[s 6≈ t] 6≈Hoist C[>] ∨ s ≈ t C[∀x. t] ∀Hoist C[⊥] ∨ {x 7→ y}t ≈ > C[∃x. t] ∃Hoist C[>] ∨ {x 7→ y}t ≈ ⊥

The rules are subject to the following side conditions:


Rationale for the Rules Our calculus is a graceful generalization of superposition: if the input clauses do not contain any Boolean terms, it coincides with standard superposition. In addition to the standard superposition rules Sup, Factor, and Irrefl, our calculus contains various rules to deal with Booleans. For each logical symbol and quantifier, we must consider the case where it is true and the case where it is false. Whenever possible, we prefer rules that rewrite the Boolean subterm in place (with names ending in Rw). When this cannot be done in a satisfiability-preserving way, we resort to rules hoisting the Boolean subterm into a dedicated literal (with names ending in Hoist). For terms rooted by an uninterpreted predicate, the rule BoolHoist only deals with the case that the term is false. If it is true, we rely on Sup to rewrite it to eventually.

*Example 3.* The clause a ∧ ¬a ≈ can be refuted by the core inferences as follows. First we derive a ≈ (displayed on the left) and then we use it to derive ⊥ (displayed on the right). In this and the following example, we assume eager selection of literals whenever the selection restrictions allow it.

The derivation illustrates how BoolHoist and Sup replace uninterpreted predicates by and <sup>⊥</sup> to allow BoolRw to eliminate the surrounding logical symbols.

*Example 4.* The clause (∃x. ∀y. y ≈ x) ≈ can be refuted as follows:

$$\frac{\frac{(\exists x.\forall y.\,y\,\,g\,\,x)\approx\mathsf{T}}{(\forall y.\,y\,\,\mathtt{g\,\,sk\,\_{x.\forall y.\,y\,\,g}})\approx\mathsf{T}}\,\exists\mathrm{Rw}}{\begin{array}{l} \mathsf{T}\lor(y'\;\mathtt{g\,\,sk\,\_{x.\forall y.\,y\,\,g}})\approx\mathsf{T} \\ \hline \bot\approx\mathsf{T}\lor(y'\;\mathtt{g\,\,sk\,\_{x.\forall y.\,y\,\,g}})\approx\mathsf{T} \\ \hline \frac{(y'\;\mathtt{g\,\,sk\,\_{x.\forall y.\,y\,\,g}})\approx\mathsf{T}}{\bot}\,\mathit{g\,\,Rw} \\ \hline \bot\end{array}}\,\mathit{L\,\mathit{L\,I}}$$

Redundancy Criterion In standard superposition, a clause is defined as redundant if all of its ground instances follow from smaller ground instances of other clauses. We keep this definition, but use a nonstandard notion of ground instances, inspired by constraint superposition [23]. In our completeness proof, this new notion of ground instances ensures that ground instances of the conclusion of <sup>∀</sup>Rw, <sup>∃</sup>Rw, <sup>∀</sup>Hoist, and <sup>∃</sup>Hoist inferences are smaller than the corresponding instances of their premise by property (O2).

Definition 5 (Redundancy of clauses). The *ground instances* of a clause C are all ground clauses of the form γC where γ is a substitution such that for all variables x, the only Boolean subterms of γx are ⊥ and . A ground clause C is *redundant* w.r.t. a ground clause set N if there exist clauses C1,...,C<sup>k</sup> ∈ N such that C1,...,C<sup>k</sup> |= C and C  C<sup>i</sup> for all 1 ≤ i ≤ k. A nonground clause C is redundant w.r.t. clauses N if C is strictly subsumed by a clause in N or every ground instance of C is redundant w.r.t. ground instances of N.

In standard superposition, an inference is defined as redundant if all its ground instances are, and a ground inference is defined as redundant if its conclusion follows from other clauses smaller than the main premise. We keep this definition as well, but we use a nonstandard notion of ground instances for some of the Boolean rules. In our report, we define a slightly stronger variant of inference redundancy via an explicit ground calculus, but the following notion is also strong enough to justify the few prover optimizations based on inference redundancy we know from the literature (e.g., simultaneous superposition [7]).

Definition 6 (Redundancy of inferences). <sup>A</sup> *ground instance* of a <sup>∀</sup>Rw, <sup>∃</sup>Rw, <sup>∀</sup>Hoist, or <sup>∃</sup>Hoist inference is an inference obtained by applying a grounding substitution to premise and conclusion, regardless of whether the result is a valid <sup>∀</sup>Rw, <sup>∃</sup>Rw, <sup>∀</sup>Hoist, or <sup>∃</sup>Hoist inference. A *ground instance* of an inference ι of other rules is an inference ι of the same rule such that premises and conclusion of ι are ground instances of the respective premises and conclusion of ι. For ι , we use selection functions that select the ground literals and Boolean subterms corresponding to the ones selected in the nonground premises. A ground inference with main premise C, side premises C1,...,Cn, and conclusion D is *redundant* w.r.t. N if there exist clauses D1,...,D<sup>k</sup> ≺ C in N such that D1,...,Dk, C1,...,C<sup>n</sup> |= D. A nonground inference is redundant if all its ground instances are redundant.

A clause set N is *saturated* if every inference from N is redundant w.r.t. N.

Simplification Rules The redundancy criterion is a graceful generalization of the criterion of standard superposition. Thus, the standard simplification and deletion rules, such as deletion of trivial literals and clauses, subsumption, and demodulation, can be justified. Demodulation below quantifiers is justified if the term order is compatible with contexts below quantifiers.

Some calculus rules can act as simplifications. <sup>⊥</sup>Elim can always be a simplification. Given a clause on which both Rw and Hoist apply, where ∈ {∀, ∃}, the clause can be replaced by the conclusions of these rules. If Rw does not apply because of condition 4 or 5, Hoist alone can be a simplification. Also justified by redundancy, the rules BoolHoist and Hoist can simultaneously replace all occurrences of the eligible subterm they act on. For example, applying <sup>≈</sup>Hoist to <sup>p</sup>(<sup>x</sup> <sup>≈</sup> <sup>y</sup>) <sup>≈</sup> <sup>∨</sup> <sup>q</sup>(<sup>x</sup> <sup>≈</sup> <sup>y</sup>) <sup>≈</sup> <sup>⊥</sup> yields <sup>p</sup>(⊥) <sup>≈</sup> <sup>∨</sup> <sup>q</sup>(⊥) <sup>≈</sup> <sup>⊥</sup> <sup>∨</sup> <sup>x</sup> <sup>≈</sup> <sup>y</sup>.

While experimenting with our implementation, we have observed that the following simplification rule from Vampire [18] can substantially shorten proofs:

$$\frac{s \not\approx t \lor C[s]}{s \not\approx t \lor C[t]} \text{LOLRw}$$

In this rule, we require s  t.

Interpreting literals of the form s ≈ as s ≈ ⊥ and s ≈ ⊥ as s ≈ we can apply the rule even to these positive literals. This especially convenient with rules such as BoolHoist. Consider the clause C = p<sup>i</sup> (⊥) ≈ ⊥ ∨ q ≈ ⊥, assume no literal is selected and the Boolean selection function always selects a subterm <sup>p</sup>(⊥). Applying BoolHoist to <sup>C</sup> we get <sup>p</sup>(⊥) <sup>≈</sup> ∨p<sup>i</sup>−1(⊥) <sup>≈</sup> ⊥∨<sup>q</sup> <sup>≈</sup> <sup>⊥</sup>. This can then be simplified to a tautological clause p(⊥) ≈ ∨ p(⊥) ≈ ⊥ ∨ q ≈ ⊥ using <sup>i</sup> <sup>−</sup> <sup>2</sup> LocalRw steps. If we did not use LocalRw, BoolHoist would produce i − 2 intermediary clauses starting from C, none of which would be recognized as a tautology.

Many rules of our calculus replace subterms with or ⊥. After this replacement, resulting terms can be simplified using Boolean equivalences that specify the behavior of logical operations on and ⊥. To this end, we use the rule BoolSimp [33], similar to simp of Leo-III [27, Sect. 4.2.1]:

$$\frac{C[s]}{C[t]} \text{BooL} \\ \text{SIMP}$$

This rule replaces s with t whenever s ≈ t is contained in a predefined set of tautological equations. In addition to all equations that Leo-III uses for simp, we also include more complex ones, such as (¬u → u) ≈ u and (u<sup>1</sup> → ··· → u<sup>n</sup> → v<sup>1</sup> ∨ ··· ∨ vm) ≈ where u<sup>i</sup> = v<sup>j</sup> for some i and j. The exhaustive list is given in our technical report. Using BoolSimp and <sup>⊥</sup>Elim, the twelve steps of Example 3 can be replaced by just two simplification steps.

BoolSimp simplifies terms with logical symbol roots if one argument is either or <sup>⊥</sup> or if two arguments are identical. Thus, after simplification, BoolRw applies only in two remaining cases: if all arguments of a logical symbol are distinct variables and if the sides of a (dis)equation are different and unifiable. This observation can be used to streamline the implementation of BoolRw.

#### 4 Refutational Completeness

Our calculus is dynamically refutationally complete. All the rules that do not introduce Skolem symbols are also sound.

Completeness Theorem 7. *Let* S<sup>0</sup> *be an unsatisfiable set of clauses. Let* (Si)<sup>∞</sup> <sup>i</sup>=0 *be a fair derivation—i.e., a derivation where* <sup>∞</sup> i=0 <sup>∞</sup> <sup>j</sup>=<sup>i</sup> <sup>S</sup><sup>j</sup> *is saturated. Then* ⊥ ∈ <sup>S</sup><sup>i</sup> *for some* <sup>i</sup>*.*

We outline some key parts of the proof here and refer to our technical report [25] for the details. We first define a ground version of our calculus with standardly inherited redundancy criterion and prove it complete. Devising suitable ground analogues of the rules ∀Rw and ∃Rw was difficult because the arguments of the Skolems depend on the variables occurring in the premise. Therefore, we parameterize the ground calculus by a function that provides ground Skolem terms in the ground versions of these rules. When lifting the completeness result to the nonground level, we instantiate the parameter with a specific function that allows us to lift the <sup>∀</sup>Rw and <sup>∃</sup>Rw inferences.

To prove the ground calculus complete, we employ the framework for reduction of counterexamples [3]. It requires us to construct an interpretation I given a saturated unsatisfiable clause set that does not contain ⊥. Then we must show that any counterexample—i.e., a clause that does not hold in I—can be reduced to a smaller (≺) counterexample by some inference.

The interpretation I is defined by a normalizing rewrite system as in the standard completeness proof of superposition. To ensure a correct interpretation of Booleans, we incrementally add Boolean rewrite rules along with the rules produced by clauses as usual. If a counterexample can be rewritten by a Boolean rule, we reduce it by a Rw or Hoist inference. If it can be rewritten by a rule produced by a clause, we reduce it by a Sup inference.

We derive the dynamic completeness of our nonground calculus using the saturation framework [35]. It gives us a nonground clause set N to work with. We then have to choose the parameters of our ground calculus such that all of its inferences from the grounding of N are redundant or liftable. We show that inferences rewriting below variables are redundant. Other inferences we show to be liftable—i.e., they are a ground instance of some inference from N.

## 5 Inprocessing Clausification Methods

Our calculus makes preprocessing clausification unnecessary: A problem specified by a formula f can be represented as a clause f ≈ . Our redundancy criterion allows us to add various sets of rules to steer the inprocessing clausification.

Without any additional rules, our core calculus rules perform all the necessary reasoning about formulas. We call this method *inner delayed clausification* because the calculus rules tend to operate on the inner Boolean subterms first.

The *outer delayed clausification* method adds the following rules to the calculus, which are guided by the outermost logical symbols. Let s and t be Boolean terms. Below, we let <sup>s</sup><sup>+</sup> range over literals of the form <sup>s</sup> <sup>≈</sup> and <sup>s</sup> ≈ ⊥, and s<sup>−</sup> over literals of the form s ≈ ⊥ and s ≈ .

$$\begin{array}{cc} s^+ \lor C \\ \hline \hline co(s, C) \\ \hline co(s, C) \end{array} + \begin{array}{c} s^- \lor C \\ \hline co(\neg s, C) \\ \hline co(\neg s, C) \end{array} - \text{OUTERCLAus} $$
 
$$\begin{array}{c} s \approx t \lor C \\ s \approx \bot \lor C \\ \hline s \not\approx t \lor C \\ s \approx \bot \lor t \approx \bot \lor C \\ \hline \end{array} \approx \text{OUTERCLAus} $$

The rules <sup>+</sup>OuterClaus and <sup>−</sup>OuterClaus are applicable to any term <sup>s</sup> whose root is a logical symbol, whereas the rules <sup>≈</sup>OuterClaus and <sup>≈</sup>OuterClaus are only applicable if neither <sup>s</sup> nor <sup>t</sup> is or <sup>⊥</sup>. Clearly, our redundancy criterion allows us to replace the premise of all Outer-Claus-rules with their conclusions. Nonetheless, the rules <sup>≈</sup>OuterClaus and <sup>≈</sup>OuterClaus are not used as simplification rules since destructing equivalences disturbs the syntactic structure of the formulas, as noted by Ganzinger and Stuber [13]. The function *oc*(s, C) analyzes the shape of the formula s and distributes it over the clause C. For example, *oc*(s<sup>1</sup> → s2, C) = {s<sup>1</sup> ≈ ⊥ ∨ s<sup>2</sup> ≈ ∨ C}, and *oc*(¬(s<sup>1</sup> ∨ s2), C) = {s<sup>1</sup> ≈ ⊥ ∨ C, s<sup>2</sup> ≈ ⊥ ∨ C}. This function also replaces quantified terms by either a fresh free variable or a Skolem in the body of the quantified term, depending on the polarity. The full definition of *oc*(s, C) is specified in our technical report.

A third inprocessing clausification method is *immediate clausification*. It first preprocesses the input problem using a standard first-order clausification procedure such as Nonnengart and Weidenbach's [24]. Then, during the proof search, when a clause C appears on which OuterClaus rules could be applied, we apply the standard clausification procedure on the formula ∀x. C ¯ instead (where x¯ are the free variables of C), and replace C with the clausification results. With this method, the formulas are clausified in one step, making intermediate clausification results inaccessible to the simplification machinery.

Renaming Common Formulas Following Tseitin [31], clausification procedures usually rename common formulas to prevent a possible combinatorial explosion caused by naive clausification. In our two delayed clausification methods, we realize this idea using the following rule:

$$\begin{array}{cccc} & C\_1[\sigma\_1 f] & \cdots & C\_n[\sigma\_n f] \\ \hline \\ C\_1[\sigma\_1 \mathfrak{p}(\bar{x})] & \cdots & C\_n[\sigma\_n \mathfrak{p}(\bar{x})] & R\_1 & \cdots & R\_m \\ \end{array} \text{RENAME}$$

Here, the formula f has a logical root, x¯ are the distinct free variables in f, p is a fresh symbol, σ<sup>i</sup> is a substitution, and the clauses R1,...,R<sup>m</sup> are the result of simplifying a *definition clause* R = p(¯x) ≈ f as described below. The rule avoids exponential explosion by replacing n positions in which results of f's clausification will appear into a single position in R. Optimizations such as polarity-aware renaming [24, Sect. 4] also apply to Rename.

Several issues arise with Rename as an inprocessing rule. We need to ensure that in R, f  p(¯x), since otherwise demodulation might reintroduce a formula f in the simplified clauses. This can be achieved by giving the fresh symbol p a precedence smaller than that of all symbols initially present in the problem (other than and ⊥). To ensure the precedence is well founded, the precedence of p must be greater than that of symbols previously introduced by the calculus. For KBO, we additionally set the weight of p to the minimal possible weight.

For Rename to be used as a simplification rule, we need to ensure that the conclusions are smaller than the premises. This is trivially true for all clauses other than the clause R. For example, let C<sup>i</sup> = f ≈ (σ<sup>i</sup> is the identity). Clearly, R is larger than Ci. However, we can view the definition clause R as two clauses <sup>R</sup><sup>+</sup> <sup>=</sup> <sup>p</sup>(¯x) <sup>≈</sup> <sup>⊥</sup> <sup>∨</sup> <sup>f</sup> <sup>≈</sup> and <sup>R</sup><sup>−</sup> <sup>=</sup> <sup>p</sup>(¯x) <sup>≈</sup> <sup>∨</sup> <sup>f</sup> <sup>≈</sup> <sup>⊥</sup>. Then, we can apply a single step of the OuterClaus rules to R<sup>+</sup> and R<sup>−</sup> (on their subformula f), which further results in clauses R1,...,Rm. Inspecting the OuterClaus rules, it is clear that m ≤ 4, which makes enforcing this simplification tolerable. Furthermore, as f is simplified in each of R1,...,Rm, they are smaller than any premise Ci.

Another potential source of a combinatorial explosion in our calculus are formulas that occur deep in the arguments of uninterpreted predicates. Consider the clause C = p<sup>i</sup> (x) <sup>≈</sup> <sup>∨</sup> <sup>q</sup><sup>j</sup> (y) <sup>≈</sup> where i, j > <sup>2</sup>. If the first and the second literal are eligible in <sup>C</sup>, any clause <sup>p</sup><sup>i</sup><sup>1</sup> (x) <sup>≈</sup> <sup>∨</sup> <sup>p</sup><sup>i</sup><sup>2</sup> (⊥) <sup>≈</sup> ∨ ··· ∨ <sup>p</sup><sup>i</sup>*<sup>k</sup>* (⊥) <sup>≈</sup> <sup>∨</sup> <sup>q</sup><sup>j</sup><sup>1</sup> (y) <sup>≈</sup> <sup>∨</sup> <sup>q</sup><sup>j</sup><sup>2</sup> (⊥) <sup>≈</sup> ∨···∨ <sup>q</sup><sup>j</sup>*<sup>l</sup>* (⊥) <sup>≈</sup> (where <sup>i</sup>1+···+i<sup>k</sup> <sup>=</sup> <sup>i</sup> and <sup>j</sup>1<sup>+</sup> ··· <sup>+</sup> <sup>j</sup><sup>l</sup> <sup>=</sup> <sup>j</sup>) , resulting from multiple BoolHoist applications, can be obtained in many different ways. This explosion can be avoided using the following rule:

$$\frac{s \approx t \lor C}{\mathbf{p}(\overline{x}) \approx \mathbf{T} \lor C \qquad R\_1 \quad \cdots \quad R\_4} \text{RENAEDEEP}$$

where p is a fresh symbol, x¯ are all free variables occurring in s ≈ t, the clauses R1,...,R<sup>4</sup> result from simplifying R = p(¯x) ≈ (s ≈ t) as described above, and we impose the same precedence and weight restrictions on p as for Rename. Finally, we require that both s ≈ t and C contain deep Booleans where a Boolean subterm u|<sup>p</sup> of a term u is a *deep Boolean* if there are at least two distinct proper prefixes q of the position p such that the root of u|<sup>q</sup> is an uninterpreted predicate.

Similarly to Rename, the definition clause R can be larger than the premise. As OuterClaus-rules might not apply to <sup>s</sup> <sup>≈</sup> <sup>t</sup>, we need a different solution:

$$\frac{C[u]}{C[\mathfrak{L}] \lor u \approx \mathfrak{T} \quad C[\mathfrak{T}] \lor u \approx \mathfrak{L}} \\ \text{BooLHosSTIMP}$$

In this rule u is a non-variable Boolean subterm, different from and ⊥, whose indicated occurrence is not in a literal u ≈ b where b is , ⊥ or a variable. Clearly, both conclusions of BoolHoistSimp are smaller than the premise. As before, observing that <sup>R</sup> is equivalent to two clauses <sup>R</sup><sup>+</sup> <sup>=</sup> <sup>p</sup>(¯x) <sup>≈</sup> ⊥∨<sup>s</sup> <sup>≈</sup> <sup>t</sup> and R<sup>−</sup> = p(¯x) ≈ ∨s <sup>≈</sup> <sup>t</sup>, we simplify <sup>R</sup><sup>+</sup> and <sup>R</sup><sup>−</sup> into clauses that are guaranteed to be smaller than the premise. This is achieved by applying BoolHoistSimp to one of the deep Boolean occurrences in both R<sup>+</sup> and R−, which produces R1,...,R<sup>4</sup> and reduces the size of resulting clauses enough for them to be smaller than the premise of RenameDeep. The RenameDeep rule can be applied analogously to negative literals s ≈ t.

#### 6 Implementation

Zipperposition [11] is an automatic theorem prover designed for easy prototyping of various extensions of superposition. So far, it has been extended to support induction, arithmetic, and various fragments of higher-order logic. We have implemented our calculus and its extensions described above in Zipperposition.

Zipperposition has long supported λ as the only binder. Because introducing new binders would significantly complicate the implementation, we decided to represent the terms ∀x. t and ∃x. t as ∀(λx. t) and ∃(λx. t), respectively.

We introduced a normalized presentation of predicate literals as either s ≈ or s ≈ ⊥. As Zipperposition previously encoded them as s ≈ or s ≈ , enforcing the new encoding was a source of tedious implementation effort.

Factor inferences happen even when the maximal literal is selected since the discovery of condition (3) as described in Sect. 3 came after the evaluation.

Zipperposition's existing selection functions were not designed with Boolean subterm selection in mind. For instance, a function that selects a literal L with a selectable Boolean subterm s can make s eligible, even if the Boolean selection function did not select s. To mitigate this issue, we can optionally block selection of literals that contain selectable Boolean subterms.

We implemented four Boolean selection functions: selecting the leftmost innermost, leftmost outermost, syntactically largest or syntactically smallest selectable subterm. Ties are broken by selecting the leftmost term. Additionally, we implemented a Boolean selection function that does not select any subterm.

Vukmirović and Nummelin [33, Sect. 3.4] explored inprocessing clausification as part of their pragmatic approach to higher-order Boolean reasoning. They describe in detail how the formula renaming mechanism is implemented. We reuse their mechanism, and simplify definition clauses as described in Sect. 5.

#### 7 Evaluation

The goal of our evaluation was to answer the following questions:


We filtered TPTP [29] and SMT-LIB [5] to get first-order benchmarks that actually do use the Boolean type. In TPTP THF we found 145 such problems (*TPTP Bool*) and in the UF section of SMT-LIB 5507 such problems. Martin Desharnais and Jasmin Blanchette generated 1253 Sledgehammer problems that target our logic. To measure the overhead of our calculus, we randomly chose 1000 FOF and CNF problems from the TPTP (*TPTP FO*). Even with this sample the experiment could take up to (145+5507+1253+1000) × #modes × 300 s ≈ 9 CPU months. On StarExec servers, evaluation roughly took three days under low load. Otherwise evaluating on all 13 000 FOF and CNF problems could have taken 2.5 times longer.

SMT-LIB interprets the symbol ite as the standard if-then-else function [5, Sect. 3.7.1]. Whenever a term s = ite(t1, t2, t3) of type τ occurs in a problem, we replace s with f<sup>τ</sup> (t1, t2, t3), where f<sup>τ</sup> is a fresh symbol denoting the ite function of a particular return type. To comply with SMT-LIB, we add the following axioms: ∀x y. f<sup>τ</sup> (, x, y) ≈ x and ∀x y. f<sup>τ</sup> (⊥, x, y) ≈ y. SMT-LIB allows the use of let variable bindings [5, Sect. 3.6.1]. We simply replace each variable with its definition in the body of the let bindings.

Currently, among competing superposition-based provers only E and Vampire support first-order logic with interpreted Booleans, and they do so through preprocessing. We could not evaluate Vampire in the first-order mode with FOOL preprocessing because it yielded unsound results on TPTP Bool benchmarks. We were able to run E on all benchmarks, except for the ones in SMT syntax.

We used Zipperposition's first-order portfolio, which invokes the prover sequentially with up to 13 configurations in different time slices. To compare different features, we ran different *modes* that enable a given feature in all of the portfolio configurations. All experiments were performed on the StarExec Iowa servers [28], equipped with Intel Xeon E5-2609 0 CPUs clocked at 2.40 GHz. We set the CPU time limit to 300 s. Figure 1 displays the results. An empty cell indicates that a mode is not evaluated on that benchmark set. An archive with the raw evaluation data is publicly available.<sup>4</sup>

A preprocessing transformation that removes all Boolean subterms occurring as arguments of symbols [34, Sect. 8], similar to Kotelnikov et al.'s FOOL clausification approach [16], is implemented in Zipperposition. To answer question 1, we enabled preprocessing and compared it to our new calculus parameterized with the Boolean selection function that selects the smallest selectable subterm. The mode using our new calculus performs immediate inprocessing clausification, and we call it *base*, while the mode that preprocesses Boolean subterms is denoted by *preprocess* in Figure 1.

The obtained results do not give a conclusive answer to question 1. On both TPTP Bool and Sledgehammer problems, some configuration of our new calculus manages to prove one problem more than preprocessing. On SMT-LIB benchmarks, the best configuration of our calculus matches preprocessing. This shows that our calculus already performs roughly as well as previously known techniques and suggests that it will be able to outperform preprocessing techniques after tuning of its parameters.

For context, we provide the evaluation of E on supported benchmarks. On TPTP FO benchmarks it solves 643 problems, on TPTP Bool benchmarks 144

<sup>4</sup> https://doi.org/10.5281/zenodo.4550787


Fig. 1: Number of problems solved per benchmark set and Zipperposition mode. The x-axes start from the number of problems solved by all evaluated modes.

problems, and on Sledgehammer benchmarks 674 problems. Note that there is no straightforward way to compare these results with Zipperposition.

Our *base* mode uses immediate inprocessing clausification. To answer question 2, we compared *base* with a variant of *base* with outer delayed clausification (*base+outer* ) and with a variant with inner delayed clausification (*base+inner* ). In the delayed modes, we invoke the Rename rule on formulas that are discovered to occur more than four times in the proof state.

The results show that inner delayed clausification, which performs the laziest form of clausification, gives the worst results on most benchmark sets. Outer delayed clausification performs roughly as well as immediate clausification on problems targeting our logic. On purely first-order problems, it performs slightly worse than immediate clausification. However, outer delayed clausification solves 17 problems not solved by immediate clausification on these problems. This suggests that it opens new possibilities for first-order reasoning that need to be explored further with specialized strategies and additional rules.

We found a problem with a conjecture of the form s → s that only the delayed clausification modes can prove: the TPTP problem SWV122+1. The subformula renaming mechanism of immediate clausification obfuscates this problem, whereas delayed clausification allows BoolSimp to convert the negated conjecture to ⊥ directly, completing the proof in half a second.

To answer question 3, we compared the mode of Zipperposition in which all rules introduced by our calculus are disabled (*off* ) with *base* on purely first-order problems. Our results show that both modes perform roughly the same.

To answer question 4, we evaluated the Boolean selection functions we have implemented: syntactically smallest selectable term (used in *base*), syntactically largest selectable term (selmax), leftmost innermost selectable term (selli), leftmost outermost selectable term (sello), and no Boolean selection (sel∅). We also evaluated two modes in which the rules LocalRw and BoolHoistSimp (BHS) are enabled. None of the selection functions influences the performance greatly. Similarly, we observe no substantial difference regardless of whether the rules LocalRw and BoolHoistSimp are enabled.

# 8 Related Work and Conclusion

The research presented in this paper extends superposition in two directions: with inprocessing clausification and with first-class Booleans. The first direction has been explored before by Ganzinger and Stuber [13], and others have investigated it in the context of other superposition-related calculi [1,4,9,20,21].

The other direction has been explored before by Kotelnikov et al., who developed two approaches to cope with first-class Booleans [16,17]. For the quantified Boolean formula fragment of our logic, Seidl et al. developed a translation into effectively propositional logic [26]. More general approaches to incorporate theories into superposition include superposition for finite domains [14], hierarchic superposition [6], and superposition with (co)datatypes [10].

For SMT solvers [22], supporting first-class Booleans is a widely accepted standard [5]. In contrast, the TPTP TFX format [30], intended to promote firstclass Booleans in the rest of the automated reasoning community, has yet to gain traction. Software verification tools could clearly benefit from its popularization, as some of them identify terms and formulas in their logic, e.g., Why3 [12].

In conclusion, we devised a refutationally complete superposition calculus for first-order logic with interpreted Booleans. Its redundancy criterion allows us to flexibly add inprocessing clausification and other simplification rules. We believe our calculus is an excellent choice for the basis of new superposition provers: it offers the full power of standard superposition, while supporting rich input languages such as SMT-LIB and TPTP TFX. Even with unoptimized implementation and basic strategies, our calculus matches the performance of earlier approaches. In addition, the freedom it offers in term order, literal and Boolean subterm selection opens possibilities that are yet to be explored. Overall, our calculus appears as a solid foundation for richer logics in which the Boolean type cannot be efficiently preprocessed, such as higher-order logic [8]. In future work, we plan to tune the parameters and would find it interesting to combine our calculus with clause splitting techniques, such as AVATAR [32].

Acknowledgment Martin Desharnais and Jasmin Blanchette generated the Sledgehammer benchmarks. Simon Cruanes helped us with the implementation. The anonymous reviewers, Ahmed Bhayat, Jasmin Blanchette, and Uwe Waldmann suggested textual improvements. The maintainers of StarExec Iowa let us use their service. We thank them all. Nummelin's research has received funding from the Netherlands Organization for Scientific Research (NWO) under the Vidi program (project No. 016.Vidi.189.037, Lean Forward). Bentkamp and Vukmirović's research has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 713999, Matryoshka).

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Superposition for Full Higher-order Logic

Alexander Bentkamp<sup>1</sup> , Jasmin Blanchette1,2,<sup>3</sup> , Sophie Tourret2,<sup>3</sup> , and Petar Vukmirovic´1

<sup>1</sup> Vrije Universiteit Amsterdam, Amsterdam, the Netherlands

{a.bentkamp,j.c.blanchette,p.vukmirovic}@vu.nl

<sup>2</sup> Université de Lorraine, CNRS, Inria, LORIA, Nancy, France

sophie.tourret@inria.fr

<sup>3</sup> Max-Planck-Institut für Informatik, Saarland Informatics Campus, Saarbrücken, Germany

Abstract. We recently designed two calculi as stepping stones towards superposition for full higher-order logic: Boolean-free λ-superposition and superposition for first-order logic with interpreted Booleans. Stepping on these stones, we finally reach a sound and refutationally complete calculus for higher-order logic with polymorphism, extensionality, Hilbert choice, and Henkin semantics. In addition to the complexity of combining the calculus's two predecessors, new challenges arise from the interplay between λ-terms and Booleans. Our implementation in Zipperposition outperforms all other higher-order theorem provers and is on a par with an earlier, pragmatic prototype of Booleans in Zipperposition.

# 1 Introduction

Superposition is a leading calculus for first-order logic with equality. We have been wondering for some years whether it would be possible to gracefully generalize it to extensional higher-order logic and use it as the basis of a strong higher-order automatic theorem prover. Towards this goal, we have, together with colleagues, designed superposition-like calculi for three intermediate logics between first-order and higherorder logic. Now we are finally ready to assemble a superposition calculus for full higher-order logic. The filiation of our new calculus from Bachmair and Ganzinger's standard first-order superposition is as follows:


Our goal was to devise an efficient calculus for higher-order logic. To achieve it, we pursued two objectives. First, the calculus should be refutationally complete. Second, the calculus should coincide as much as possible with its predecessors oSup and λSup on the respective fragments of higher-order logic (which in turn essentially coincide with Sup on first-order logic). Achieving these objectives is the main contribution of this paper. We made an effort to keep the calculus simple, but often the refutational completeness proof forced our hand to add conditions or special cases.

Like oSup, our calculus oλSup operates on clauses that can contain Boolean subterms, and it interleaves clausification with other inferences. Like λSup, oλSup eagerly βη-normalizes terms, employs full higher-order unification, and relies on a fluid subterm superposition rule (FLUIDSUP) to simulate superposition inferences below applied variables—i.e., terms of the form *y t*<sup>1</sup> ...*tn* for *n* ≥ 1.

Because oSup contains several superposition-like inference rules for Boolean subterms, our completeness proof requires dedicated *fluid Boolean subterm hoisting rules* (FLUIDBOOLHOIST, FLUIDLOOBHOIST), which simulate Boolean inferences below applied variables, in addition to FLUIDSUP, which simulates superposition inferences.

Due to restrictions related to the term order that parameterizes superposition, it is difficult to handle variables bound by unclausified quantifiers if these variables occur applied or in arguments of applied variables. We solve the issue by replacing such quantified terms ∀*y*.*t* by equivalent terms (λ*y*. *t*) ≈ (λ*y*. ) in a preprocessing step.

We implemented our calculus in the Zipperposition prover and evaluated it on TPTP and Sledgehammer benchmarks. The new Zipperposition outperforms all other higherorder provers and is on a par with an ad hoc implementation of Booleans in the same prover by Vukmirovic and Nummelin [30]. We refer to the technical report [8] for the ´ completeness proof and a more detailed account of the calculus and its evaluation.

#### 2 Logic

Our logic is higher-order logic (simple type theory) with rank-1 polymorphism, Hilbert choice, and functional and Boolean extensionality. Its syntax mostly follows Gordon and Melham [17]. We use the notation ¯*an* or ¯*a* to stand for the tuple (*a*1,...,*an*) where *n* ≥ 0. Deviating from Gordon and Melham, type arguments are explicit, written as cτ¯*<sup>m</sup>* for a symbol c : Πα¯ *<sup>m</sup>*. υ and types ¯τ*m*. In the type signature Σty, we require the presence of a nullary Boolean type constructor o and a binary function type constructor →. In the term signature Σ, we require the presence of the logical symbols , ⊥, ¬, ∧, ∨, →, ∀, ∃, ≈, and ≈. The logical symbols are shown in bold to distinguish them from the notation used for clauses below. Moreover, we require the presence of the Hilbert choice operator ε ∈ Σ. Although ε is interpreted in our semantics, we do not consider it a logical symbol. Our calculus will enforce the semantics of ε by an axiom, whereas the semantics of the logical symbols will be enforced by inference rules. We write V for the set of (term) variables. We use Henkin semantics, in the style of Fitting [15], with respect to which we can prove our calculus refutationally complete. In summary, our logic essentially coincides with the TPTP TH1 format [20].

We generally view terms modulo αβη-equivalence. When defining operations that need to analyze the structure of terms, however, we use a custom normal form as the default representative of a βη-equivalence class: The βηQη*-normal* form *t*↓βηQ<sup>η</sup> of a term *t* is obtained by bringing the term into η-short β-normal form and finally applying the rewrite rule <sup>Q</sup><sup>τ</sup> *<sup>s</sup>* −→-<sup>Q</sup><sup>η</sup> Qτ (λ*x*. *s x*) exhaustively whenever *s* is not a λexpression. Here and elsewhere, Q stands for either ∀ or ∃.

On top of the standard higher-order terms, we install a clausal structure that allows us to formulate calculus rules in the style of first-order superposition. A literal *s* ≈˙ *t* is an equation *s* ≈ *t* or disequation *s* ≈ *t* of terms *s* and *t*; both equations and disequations are unordered pairs. A clause *L*<sup>1</sup> ∨···∨ *Ln* is a finite multiset of literals *Lj*. The empty clause is written as ⊥. This clausal structure does not restrict the logic, because an arbitrary term *t* of Boolean type can be written as the clause *t* ≈ .

We considered excluding negative literals by encoding them as (*s* ≈ *t*) ≈ ⊥, following ←→Sup [16]. However, this approach would make the conclusion of the equality factoring rule (EFACT) too large for our purposes. Regardless, the simplification machinery will allow us to reduce negative literals *t* ≈ ⊥ and *t* ≈ to *t* ≈ and *t* ≈ ⊥, respectively, thereby eliminating redundant representations of nonequational literals.

We let CSU(*s*,*t*) denote an arbitrary (preferably, minimal) complete set of unifiers for two terms *s* and *t* on the set of free variables of the clauses in which *s* and *t* occur. To compute such sets, Huet-style preunification [18] is not sufficient, and we must resort to a full unification procedure [19, 29]. To cope with the nontermination of such procedures, we use dovetailing as described by Vukmirovic et al. [28, Sect. 5]. ´

Some of the rules in our calculus introduce Skolem symbols, representing objects mandated by existential quantification. We assume that these symbols do not occur in the input problem. More formally, given a problem over a term signature Σ, our calculus operates on a Skolem-extended term signature Σsk that, in addition to all symbols from Σ, inductively contains symbols skΠα. ¯ <sup>∀</sup>*x*¯. <sup>∃</sup>*z*.*t z* : Πα. ¯ τ¯ → υ for all types υ, variables *z* : υ, and terms *t* : υ → o over Σsk, where ¯α are the free type variables occurring in *t* and ¯*x* : ¯τ are the free term variables occurring in *t*, both in order of first occurrence.

# 3 The Calculus

The oλSup calculus closely resembles λSup, augmented with rules for Boolean reasoning that are inspired by oSup. As in λSup, superposition-like inferences are restricted to certain first-order-like subterms, the *green subterms*, which we define inductively as follows: Every term *t* is a green subterm of *t*, and for all symbols f ∈ Σ \ {∀,∃}, if *t* is a green subterm of *ui* for some *i*, then *t* is a green subterm of fτ¯ *u*¯. For example, the green subterms of f(g (¬p)) (∀τ (λ*x*.q)) (*y* a) (λ*x*.h b) are the term itself, g (¬p), ¬p, p, ∀τ (λ*x*. q), *y* a, and λ*x*. h b. We write *s t* to denote a term *s* with a green subterm *t* and call the first-order-like context *s* a *green context*.

Following λSup, we call a term *t fluid* if (1) *t*↓βηQ<sup>η</sup> is of the form *y u*¯*<sup>n</sup>* where *n* ≥ 1, or (2) *t*↓βηQ<sup>η</sup> is a λ-expression and there exists a substitution σ such that *t*σ↓βηQ<sup>η</sup> is not a λ-expression (due to η-reduction). Intuitively, fluid terms are terms whose normal form can change radically as a result of instantiation.

We define deeply occurring variables as in λSup, but exclude λ-expressions directly below quantifiers: A variable *occurs deeply* in a clause *C* if it occurs inside an argument of an applied variable or inside a λ-expression that is not directly below a quantifier.

Preprocessing. Our completeness theorem requires that quantified variables do not appear in certain higher-order contexts. We use preprocessing to eliminate problematic occurrences of quantifiers. The rewrite rules ∀≈ and ∃≈, which we collectively denote by <sup>Q</sup>≈, are defined as ∀τ −→-∀≈ <sup>λ</sup>*y*. *<sup>y</sup>* <sup>≈</sup> (λ*x*. ) and ∃τ −→-∃≈ λ*y*. *y* ≈ (λ*x*. ⊥) where the rewritten occurrence of Qτ is unapplied or has an argument of the form λ*x*. *v* such that *x* occurs as a nongreen subterm of *v*. If either of these rewrite rules can be applied to a given term, the term is Q≈*-reducible*; otherwise, it is Q≈*-normal*.

For example, the term λ*y*.∃ι → ι (λ*x*.g *x y*(*zy*) (f *x*)) is Q≈-normal. A term may be Q≈-reducible because a quantifier appears unapplied (e.g., g ∃ι ); a quantified variable occurs applied (e.g., ∃ι → ι (λ*x*. *x* a)); a quantified variable occurs inside a nested λexpression (e.g., ∀ι (λ*x*. f (λ*y*. *x*))); or a quantified variable occurs in the argument of a variable, either a free variable (e.g., ∀ι (λ*x*.*z x*)) or a variable bound above the quantifier (e.g., λ*y*. ∃ι (λ*x*. *y x*)).

A preprocessor Q≈-normalizes the input problem. Although inferences may produce Q≈-reducible clauses, we do not Q≈-normalize during the derivation process itself. Instead, Q≈-reducible ground instances of clauses will be considered redundant by the redundancy criterion. Thus, clauses whose ground instances are all Q≈-reducible can be deleted. However, there are Q≈-reducible clauses, such as *x* ∀ι ≈ a, that nevertheless have Q≈-normal ground instances. Such clauses must be kept because the completeness proof relies on their Q≈-normal ground instances.

In principle, we could omit the side condition of the Q≈-rewrite rules and eliminate all quantifiers. However, the calculus (especially, the redundancy criterion) performs better with quantifiers than with λ-expressions, which is why we restrict Q≈-normalization as much as the completeness proof allows. Extending the preprocessing to eliminate all Boolean terms as in Kotelnikov et al. [21] does not work for higher-order logic because Boolean terms can contain variables bound by enclosing λ-expressions.

Term Order. The calculus is parameterized by a well-founded strict total order on ground terms satisfying these four criteria: (O1) compatibility with green contexts i.e., *s s* implies *t s t s* ; (O2) green subterm property—i.e. *t s s* where is the reflexive closure of ; (O3) *u* ⊥ for all terms *u* ∈ { / ,⊥}; (O4) Qτ *t t u* for all types τ, terms *t*, and terms *u* such that Qτ *t* and *u* are Q≈-normal and the only Boolean green subterms of *u* are and ⊥. The restriction of (O4) to Q≈-normal terms ensures that term orders fulfilling the requirements exist, but it forces us to preprocess the input problem. We extend to literals and clauses via the multiset extensions in the standard way [2, Sect. 2.4].

For nonground terms, is required to be a strict partial order such that *t s* implies *t*θ *s*θ for all grounding substitutions θ. As in λSup, we also introduce a nonstrict variant for which we require that *<sup>t</sup>*<sup>θ</sup> *<sup>s</sup>*<sup>θ</sup> for all grounding substitutions <sup>θ</sup> whenever *t s*, and similarly for literals and clauses.

To construct a concrete order fulfilling these requirements, we define an encoding into untyped first-order terms, and compare these using a variant of the Knuth–Bendix order. In a first step, denoted*O*, the encoding translates fluid terms *t* as fresh variables *zt*; nonfluid λ-expressions λ*x*:τ. *u* as lam(*O*(τ),*O*(*u*)); applied quantifiers Qτ (λ*x*:τ. *u*) as Q1(*O*(τ),*O*(*u*)); and other termsfτ¯ *u*¯*<sup>k</sup>* asf*k*(*O*(τ¯),*O*(*u*¯*k*)). Bound variables are encoded as constants db*<sup>i</sup>* corresponding to De Bruijn indices. In a second step, denoted *P*, the

encoding replaces Q<sup>1</sup> by Q <sup>1</sup> and variables *z* by *z* whenever they occur below lam. For example, ∀ι (λ*x*.p*y y* (λ*u*.f *y y* (∀ι (λ*v*.*u*)))) is encoded as ∀1(ι,p3(*y*, *y*,lam(o,f3(*y* , *y* , ∀ <sup>1</sup>(ι,db1))))). The first-order terms can then be compared using a transfinite Knuth– Bendix order kb [22]. Let the weight of ∀<sup>1</sup> and ∃<sup>1</sup> be ω, the weight of <sup>0</sup> and ⊥<sup>0</sup> be 1, and the weights of all other symbols be less than ω. Let the precedence > be total and ⊥0,<sup>0</sup> be the symbols of lowest precedence, with ⊥<sup>0</sup> > 0. Then let *t s* if *O*(*P*(*t*)) kb *O*(*P*(*s*)) and *<sup>t</sup> <sup>s</sup>* if *O*(*P*(*t*)) kb *O*(*P*(*s*)).

Selection Functions. The calculus is also parameterized by a literal selection function and a Boolean subterm selection function. We define an element *x* of a multiset *M* to be -*maximal* for some relation if for all *<sup>y</sup>* <sup>∈</sup> *<sup>M</sup>* with *<sup>y</sup> <sup>x</sup>*, we have *<sup>y</sup>* <sup>=</sup> *<sup>x</sup>*. It is *strictly* -maximal if it is -maximal and occurs only once in *M*.

The literal selection function *HLitSel* maps each clause to a subset of *selected literals*. A literal may not be selected if it is positive and neither side is ⊥. Moreover, a literal *L y* may not be selected if *y u*¯*n*, with *n* ≥ 1, is a -maximal term of the clause.

The Boolean subterm selection function *HBoolSel* maps each clause *C* to a subset of *selected subterms* in *C*. Selected subterms must be green subterms of Boolean type. Moreover, a subterm *s* must not be selected if *s* = , if *s* = ⊥, if *s* is a variable-headed term, if *s* is at the topmost position on either side of a positive literal, or if *s* contains a variable *y* as a green subterm, and *y u*¯*n*, with *n* ≥ 1, is a -maximal term of the clause.

Eligibility. A literal *L* is (*strictly*) *eligible* w.r.t. a substitution σ in *C* if it is selected in *C* or there are no selected literals and no selected Boolean subterms in *C* and *L*σ is (strictly) -maximal in *C*σ.

The eligible subterms of a clause *C* w.r.t. a substitution σ are inductively defined as follows: Any selected subterm is eligible. If a literal *<sup>L</sup>* <sup>=</sup> *<sup>s</sup>* <sup>≈</sup>˙ *<sup>t</sup>* with *<sup>s</sup>*<sup>σ</sup> *<sup>t</sup>*<sup>σ</sup> is either eligible and negative or strictly eligible and positive, then the subterm *s* is eligible. If a subterm *t* is eligible and the head of *t* is not ≈ or ≈ , all direct green subterms of *t* are eligible. If a subterm *t* is eligible and *t* is of the form *u* ≈ *v* or *u* ≈ *v*, then *u* is eligible if *<sup>u</sup>*<sup>σ</sup> *<sup>v</sup>*<sup>σ</sup> and *<sup>v</sup>* is eligible if *<sup>u</sup>*<sup>σ</sup> *<sup>v</sup>*σ.

The Core Inference Rules. The calculus consists of the following core inference rules. The first five rules stem from λSup, with minor adaptions concerning Booleans:

$$\begin{array}{llll} \overbrace{\begin{subarray}{c} D \\ \begin{subarray}{c} D' \lor t \approx \bar{t} \end{subarray}}^{D} & C \{u\} \\ \hline (D' \lor C \{\forall'\}) \sigma & \text{Sup} & \overbrace{\begin{subarray}{c} C' \lor u \not\approx u' \\ C' \sigma \end{subarray}}^{C} \text{ERES} & \overbrace{\begin{subarray}{c} \widehat{C' \lor u' \approx \bar{\nu}' \lor u \approx \bar{\nu} \\ (C' \lor v \not\approx v' \lor u \approx \bar{\nu}') \sigma \end{subarray}}^{C} \text{EFS} \\ \end{array} & \overbrace{\begin{subarray}{c} \widehat{D' \lor t \approx \bar{t} \; \nu \; \begin{subarray}{c} \widehat{\pi} \\ (C' \lor v \not\approx v' \lor u \approx \bar{\nu}') \sigma \end{subarray}}^{C} \text{EFS} & \overbrace{\begin{subarray}{c} \widehat{C' \; \nu \; \begin{subarray}{c} \widehat{\pi} \\ \end{subarray}}^{C} \text{EFS} & \overbrace{\begin{subarray}{c} \widehat{\pi} \\ (C' \; \nu \; \begin{subarray}{c} \widehat{\pi} \\ \end{subarray}}^{C} \text{ARGB} \end{subarray}}^{C} \text{EFS} \\ \end{array} \text{EFS} \end{array}$$

SUP 1. *u* is not fluid; 2. *u* is not a variable deeply occurring in *C*; 3. if *u* is a variable *y*, there must exist a grounding substitution θ such that *t*σθ *t* σθ and *C*σθ ≺ *C*σθ, where *C* = *C*{*y* → *t* }; 4. <sup>σ</sup> <sup>∈</sup> CSU(*t*,*u*); 5. *<sup>t</sup>*<sup>σ</sup> *<sup>t</sup>* σ; 6. *u* is eligible in *C* w.r.t. σ; 7. *<sup>C</sup>*<sup>σ</sup> *<sup>D</sup>*σ; 8. *<sup>t</sup>* <sup>≈</sup> *<sup>t</sup>* is strictly eligible in *D* w.r.t. σ; 9. *t*σ is not a fully applied logical symbol; 10. if *t* σ = ⊥, the subterm *u* is at the top level of a positive literal. ERES 1. σ ∈ CSU(*u*,*u* ); 2. *u* ≈ *u* is eligible in *C* w.r.t. σ.


The following rules are concerned with Boolean reasoning and originate from oSup. They have been adapted to support polymorphism and applied variables.


FALSEELIM 1. σ ∈ CSU(*s* ≈ *s* , ⊥ ≈ ); 2. *s* ≈ *s* is strictly eligible in *C* w.r.t. σ.


2. *u* is not a variable; 3. *u* is eligible in *C* w.r.t. σ; 4. if the head of *u* is a variable, it must be applied and the affected literal must be of the form *u* ≈ , *u* ≈ ⊥, or *u* ≈ *v* where *v* is a variable-headed term; 5. for FORALLRW, the indicated occurrence of *u* is not in a literal *u* ≈ , and for EXISTSRW, the indicated occurrence of *u* is not in a literal *u* ≈ ⊥.

Like SUP, also the Boolean rules must be simulated in fluid terms. The following rules are Boolean counterparts of FLUIDSUP:

$$\frac{\mathcal{C}\langle u\rangle}{(\mathcal{C}\langle z\Delta\rangle \vee x \approx \mathsf{T})\sigma} \frac{\mathsf{F}\_{\mathsf{LU}\mathsf{UD}\mathsf{T}}}{\mathsf{Bool}\mathsf{LD}\mathsf{Hom}\mathsf{T}} \qquad \frac{\mathcal{C}\langle u\rangle}{(\mathcal{C}\langle z\mathsf{T}\rangle \vee x \approx \mathsf{\Delta})\sigma} \frac{\mathsf{F}\_{\mathsf{LU}\mathsf{UD}\mathsf{T}}}{\mathsf{Loo}\mathsf{Hom}\mathsf{Hom}\mathsf{T}}$$

FLUIDBOOLHOIST 1. *u* is fluid; 2. *z* and *x* are fresh variables; 3. σ ∈ CSU(*z x*, *u*); 4. (*z* ⊥)σ = (*z x*)σ; 5. *x*σ = and *x*σ = ⊥; 6. *u* is eligible in C w.r.t. σ.

FLUIDLOOBHOIST Like the above but with ⊥ replaced by in condition 4.

In addition to the inference rules, our calculus relies on two axioms, below. Axiom (EXT), from λSup, embodies functional extensionality; the expression diffα, β abbreviates skΠα β. <sup>∀</sup>*z y*. <sup>∃</sup>*x*.*z x*≈ *y x*α, β . Axiom (CHOICE) characterizes the Hilbert choice operator ε.

$$z\left(\text{diff}\langle\alpha,\beta\rangle z\text{y}\right) \not\approx \text{y}\left(\text{diff}\langle\alpha,\beta\rangle z\text{y}\right) \lor z \approx \text{y} \tag{\text{Ext}}\tag{\text{Ext}}$$

$$\mathbf{y}.\mathbf{x} \approx \mathbf{L} \lor \mathbf{y} \left(\boldsymbol{\varepsilon} \langle \boldsymbol{\alpha} \rangle \mathbf{y}\right) \approx \mathbf{T} \tag{\text{CHOICE}}$$

Rationale for the Rules. Most of the calculus's rules are adapted from its precursors. SUP, ERES, and EFACT are already present in Sup, with slightly different side conditions. Notably, as in λfSup and λSup, SUP inferences are required only into green contexts. Other subterms are accessed indirectly via ARGCONG and (EXT).

The rules BOOLHOIST, EQHOIST, NEQHOIST, FORALLHOIST, EXISTSHOIST, FALSEELIM, BOOLRW, FORALLRW, and EXISTSRW, concerned with Boolean reasoning, stem from oSup, which was inspired by ←→Sup. Except for BOOLHOIST and FALSEELIM, these rules have a condition stating that "if the head of *u* is a variable, it must be applied and the affected literal must be of the form *u* ≈ , *u* ≈ ⊥, or *u* ≈ *v* where *v* is a variable-headed term." The inferences at variable-headed terms permitted by this condition are our form of primitive substitution [1,18], a mechanism that blindly substitutes logical connectives and quantifiers for variables *z* with a Boolean result type.

Example 1. Our calculus can prove that Leibniz equality implies equality (i.e., if two values behave the same for all predicates, they are equal) as follows:

$$\begin{array}{c} \begin{array}{c} z \texttt{a} \approx \texttt{L} \lor z \texttt{b} \approx \texttt{T} \\ \hline (x \texttt{a} \texttt{a} \texttt{y} \,\texttt{a}) \approx \texttt{L} \lor \texttt{L} \approx \texttt{T} \lor x \texttt{b} \approx \texttt{y} \,\texttt{b} \\ \hline \texttt{T} \approx \texttt{L} \lor \texttt{L} \approx \texttt{T} \lor w \texttt{a} \,\texttt{b} \approx w \,\texttt{b} \,\texttt{a} \,\texttt{b} \\ \hline \begin{array}{c} \texttt{L} \approx \texttt{T} \lor w \,\texttt{a} \,\texttt{b} \,\texttt{b} \approx w \,\texttt{b} \,\texttt{a} \,\texttt{b} \\ \hline \end{array} \begin{array}{c} \texttt{FALSELIIM} \\ \hline \texttt{FALSELIIM} \\ \hline \texttt{FALSELIIM} \\ \hline \end{array} \begin{array}{c} \texttt{FALSELIIM} \\ \hline \texttt{FALSELIIM} \\ \hline \end{array} \end{array} \begin{array}{c} \texttt{FALSEILIM} \\ \hline \texttt{FALSELIIM} \\ \hline \end{array} \end{array}$$

The EQHOIST inference, applied on *z* b, illustrates how our calculus introduces logical symbols without a dedicated primitive substitution rule. Although ≈ does not appear in the premise, we still need to apply EQHOIST on *z* b with CSU(*z* b, *x*<sup>0</sup> ≈ *y*0) = {{*z* → λ*v*. *x v* ≈ *y v*, *x*<sup>0</sup> → *x* b, *y*<sup>0</sup> → *y* b}}. Other calculi [1, 9, 18, 26] would apply an explicit primitive substitution rule instead, yielding essentially (*x*a ≈*y*a) ≈ ⊥∨ (*x*b ≈*y*b) ≈ . However, in our approach this clause is subsumed and could be discarded immediately. By hoisting the equality to the clausal level, we bypass the redundancy criterion.

Next, BOOLRW can be applied to *x* a ≈ *y* a with CSU(*x* a ≈ *y* a, *y*<sup>0</sup> ≈ *y*0) = {{*x* → λ*v*.*w*a*vv*, *y* → λ*v*.*wv*a*v*, *y*<sup>0</sup> → *w*aaa}}. The two FALSEELIM steps remove the ⊥ ≈ literals. Then SUP is applicable with the unifier {*w* → λ*x*<sup>1</sup> *x*<sup>2</sup> *x*3. *x*2} ∈ CSU(b, *w*abb), and ERES derives the contradiction.

Like in λSup, the FLUIDSUP rule is responsible for simulating superposition inferences below applied variables, other fluid terms, and deeply occurring variables. Complementarily, FLUIDBOOLHOIST and FLUIDLOOBHOIST simulate the various Boolean inference rules below fluid terms. Initially, we considered adding a fluid version of each rule that operates on Boolean subterms, but we discovered that FLUID-BOOLHOIST and FLUIDLOOBHOIST suffice to achieve refutational completeness.

Example 2. The clause set consisting of h(*y*b) ≈ h(g ⊥) ∨ h(*y* a) ≈ h(g ) and a ≈ b highlights the need for FLUIDBOOLHOIST and its companion. The set is unsatisfiable because the instantiation {*y* → λ*x*. g (*x* ≈ a)} produces the clause h (g (b ≈ a)) ≈ h (g ⊥) ∨ h (g (a ≈ a)) ≈ h (g ), which is unsatisfiable in conjunction with a ≈ b.

The literal selection function can select either literal in the first clause. ERES is applicable in either case, but the unifiers {*y* → λ*x*. g ⊥} and {*y* → λ*x*. g } do not lead to a contradiction. Instead, we need to apply FLUIDBOOLHOIST if the first literal is selected or FLUIDLOOBHOIST if the second literal is selected. In the first case, the derivation is as follows:

$$\begin{array}{c} \cline{2-4} \text{h} \left( \text{y} \text{b} \right) \# \hleft( \text{g} \text{ L} \right) \vee \hleft( \text{a} \right) \# \hleft( \text{g} \top \right) \\ \hline \hbegin{subarray}{c} \hleft( \text{z}' \text{b} \text{L} \right) \# \hleft( \text{g} \text{L} \right) \vee \hleft( \text{z}' \text{a} \left( \text{a} \right) \right) \# \hleft( \text{g} \top \right) \vee \hleft( \text{z} \right) \\ \cline{2-4} \text{h} \left( \text{g} \left( \text{z}' \text{a} \right) \right) \# \hleft( \text{g} \top \right) \vee \hleft( \text{g} \top \right) \\ \cline{4-4} \text{a} \neq \huge{\} \qquad \hleft( \text{g} \left( \text{z}'' \text{a} \text{a} \text{x}'' \text{a} \right) \right) \# \hleft( \text{g} \top \right) \vee \hleft( \text{z} \right) \vee \hleft( \text{z} \right) \downarrow \approx \hleft( \text{x} \right)^{\text{w}} \text{b} \text{e} \text{x}'' \text{b} \\ \hline \hbegin{array}{c} \hleft( \text{g} \left( \text{a} \text{a} \text{x} \text{x}'' \text{a} \right) \right) \# \hleft( \text{g} \top \right) \vee \hleft( \text{z} \right) \downarrow \approx \hleft( \text{z} \right)^{\text{w}} \text{b} \\ \cline{4-4} \text{b} \neq \hleft( \text{g} \top \right) \vee \hleft( \text{z} \right) \downarrow \approx \hleft( \text{z} \right) \$$

The FLUIDBOOLHOIST inference uses the unifier {*y* → λ*u*.*z u*(*x u*), *z*→ λ*u*.*z* b*u*, *x* → *x* b} ∈ CSU(*z x*, *y*b). We apply ERES to the first literal of the resulting clause, with unifier {*z* → λ*uv*. g *v*} ∈ CSU(h (*z* b ⊥), h (g ⊥)). Next, we apply EQHOIST with the unifier {*x* → λ*u*. *x u* ≈ *x u*, *w* → *x* b, *w* → *x* b} ∈ CSU(*x* b, *w* ≈ *w* ) to the literal created by FLUIDBOOLHOIST, effectively performing a primitive substitution. The resulting clause can superpose into a ≈ b with the unifier {*x* → λ*u*. *u*} ∈ CSU(*x* b, b). The two sides of the interpreted equality in the first literal can then be unified, allowing us to apply BOOLRW with the unifier {*y* → a, *x* → λ*u*. a} ∈ CSU(*y* ≈ *y*, a ≈ *x* b). Finally, applying ERES twice and FALSEELIM once yields the empty clause.

Remarkably, none of the provers that participated in the CASC-J10 competition can solve this two-clause problem within a minute. Satallax finds a proof after 72 s and LEO-II after over 7 minutes. Our new Zipperposition implementation solves it in 3 s.

The Redundancy Criterion. In first-order superposition, a clause is considered redundant if all its ground instances are entailed by ≺-smaller ground instances of other clauses. In essence, this will also be our definition, but we will use a different notion of ground instances and a different notion of entailment.

Given a clause *C*, let its *ground instancesG*(*C*) be the set of all clauses of the form *C*θ for some substitution θ such that *C*θ is ground and Q≈-normal, and for all variables *x* occurring in *C*, the only Boolean green subterms of *x*θ are and ⊥. The rationale of this definition is to ensure that ground instances of the conclusion of FORALLHOIST, EX-ISTSHOIST, FORALLRW, and EXISTSRW inferences are smaller than the corresponding instances of their premise by property (O4).

The redundancy criterion's notion of entailment is defined via an encoding into a weaker logic, following λfSup and λSup. In this paper, the weaker logic is ground firstorder logic with interpreted Booleans—the ground fragment of the logic of oSup. Its signature (Σty,ΣGF) is derived from our higher-order signature (Σty,Σ) as follows. The type constructors Σty are the same in both signatures, but → is an uninterpreted type constructor in first-order logic. For each ground instance fυ¯ : τ<sup>1</sup> →···→ τ*<sup>n</sup>* → τ of a symbol <sup>f</sup> <sup>∈</sup> <sup>Σ</sup>, we introduce a first-order symbol <sup>f</sup>υ¯ *<sup>j</sup>* ∈ ΣGF with argument types ¯τ*<sup>j</sup>* and result type τ*j*+<sup>1</sup> →···→ τ*<sup>n</sup>* → τ, for each *j*. Moreover, for each ground term λ*x*. *t*, we introduce a symbol lamλ*x*. *<sup>t</sup>* ∈ ΣGF of the same type. The symbols ⊥0, 0, ¬1, ∧2, ∨2, <sup>→</sup>2, <sup>≈</sup><sup>τ</sup> <sup>2</sup>, and ≈<sup>τ</sup> <sup>2</sup> are identified with the corresponding first-order logical symbols.

We define an encoding *F* of Q≈-normal ground higher-order terms into this ground first-order logic recursively as follows: *F*(∀τ (λ*x*.*t*)) = ∀*x*.*F*(*t*) and *F*(∃τ (λ*x*.*t*)) = ∃*x*.*F*(*t*) for applied quantifiers; *F*(λ*x*. *t*) = lamλ*x*.*<sup>t</sup>* for λ-expressions; and *F*(fυ¯ *s*¯*j*) = fυ¯ *<sup>j</sup>* (*F*(*s*¯*j*)) for other terms. For quantified variables, we define *F*(*x*) = *x*. Here, Q≈ normality is crucial to ensure that bound variables do not occur applied or within λexpressions. The definition of green subterms is devised such that green subterms correspond to first-order subterms via the encoding *F* , with the exception of first-order subterms below quantifiers. The encoding *F* is extended to clauses by mapping each literal and each side of a literal individually. From the entailment relation |= for the ground first-order logic, we derive an entailment relation |=*<sup>F</sup>* on Q≈-normal ground higher-order clauses by defining *M* |=*<sup>F</sup> N* if *F*(*M*) |= *F*(*N*). This relation is weaker than standard higher-order entailment; for example, {f ≈ g} |=*<sup>F</sup>* {f a ≈ g a} (because of the subscripts added by *F* ) and {p (λ*x*. )} |=*<sup>F</sup>* {p (λ*x*. ¬ ⊥)} (because of the lam symbols used by *F* ).

Using |=*<sup>F</sup>* , we define a clause *C* to be *redundant* w.r.t. a clause set *N* if for every *D* ∈*G*(*C*), we have {*E* ∈*G*(*N*) | *E* ≺ *D*} |=*<sup>F</sup> D* or there exists a clause *C* ∈ *N* such that *<sup>C</sup> <sup>C</sup>* and *<sup>D</sup>* <sup>∈</sup>*G*(*C* ). The tiebreaker can be an arbitrary well-founded partial order on clauses; in practice, we use a well-founded restriction of the ill-founded strict subsumption relation [6, Sect. 3.4]. We denote the set of redundant clauses w.r.t. a clause set *N* by *Red*C(*N*). Note that |=*<sup>F</sup>* is weak enough to ensure that the ARGCONG inference rule and axiom (EXT) are not immediately redundant and can fulfill their purpose.

For first-order superposition, an inference is considered redundant if for each of its ground instances, a premise is redundant or the conclusion is entailed by clauses smaller than the main premise. For most inference rules, our definition follows this idea, using |=*<sup>F</sup>* for entailment; other rules need nonstandard notions of ground instances and redundancy. The definition of inference redundancy presented below is simpler than the more sophisticated notion in our technical report. Nonetheless, the redundant inferences below are a strict subset of the redundant inferences of our report and thus completeness also holds using the notion below. For the few prover optimizations based on inference redundancy that we know about (e.g., simultaneous superposition [4]), the following criterion suffices.

For SUP, ERES, EFACT, BOOLHOIST, FALSEELIM, EQHOIST, NEQHOIST, and BOOLRW, we define ground instances as usual: *Ground instances* are all inferences obtained by applying a grounding substitution to premises and conclusion such that the result adheres to the conditions of the given rule w.r.t. selection functions that select literals and subterms as in the original premise. For FLUIDSUP and FLUIDBOOLHOIST, we define ground instances in the same way except that we require that ground instances adhere to the conditions of SUP or BOOLHOIST, respectively. For FORALLRW, EXISTSRW, FORALLHOIST, EXISTSHOIST, which do not have ground instances in the sense above, we define a *ground instance* as any inference that is obtained by applying the unifier σ to the premise and then applying a grounding substitution to premise and conclusion, regardless of whether the resulting inference is an inference of our calculus.

For all rules except FLUIDLOOBHOIST and ARGCONG, we define an inference to be *redundant* w.r.t. a clause set *N* if for each ground instance ι, a premise of ι is redundant w.r.t. *G*(*N*) or the conclusion of ι is entailed w.r.t. |=*<sup>F</sup>* by clauses from *G*(*N*) that are smaller than the main (i.e., rightmost) premise of ι. For the rules FLUIDLOOB-HOIST and ARGCONG, as well as axioms (EXT) and (CHOICE)—viewed as premiseless inferences—we define an inference to be *redundant* w.r.t. a clause set *N* if all ground instances of its conclusion are contained in *G*(*N*) or redundant w.r.t. *G*(*N*). We denote the set of redundant inferences w.r.t. *N* by *Red*I(*N*).

Simplification Rules. Our redundancy criterion is strong enough to support counterparts of most simplification rules implemented in Schulz's first-order E [25, Sect. 2.3.1 and 2.3.2]. Deletion of duplicated literals, deletion of resolved literals, syntactic tautology deletion, negative simplify-reflect, and clause subsumption adhere to our redundancy criterion. Positive simplify-reflect, equality subsumption, and rewriting (demodulation) of positive and negative literals are supported if they are applied on green subterms or on other subterms that are encoded into first-order subterms by *G* and *F* . Semantic tautology deletion can be applied as well, using |=*<sup>F</sup>* ; moreover, for positive literals, the rewriting clause must be smaller than the rewritten clause.

Under some circumstances, inference rules can be applied as simplifications. The FALSEELIM and BOOLRW rules can be applied as a simplification if σ is the identity. If the head of *u* is ∀, FORALLHOIST and FORALLRW can both be applied and, together, serve as one simplification rule. The same holds for EXISTSHOIST and EXISTSRW if the head of *u* is ∃. For all of these rules, the eligibility conditions can be ignored.

Clausification. Like oSup, our calculus does not require the input problem to be clausified during the preprocessing, and it supports higher-order analogues of the three inprocessing clausification methods introduced by Nummelin et al. *Inner delayed clausification* relies on our core calculus rules to destruct logical symbols. *Outer delayed clausification* adds the following clausification rules to the calculus:

$$\frac{s \approx \mathsf{T} \lor C}{oc(s, C)} \text{PosOUTERCLaus} \qquad \frac{s \approx \bot \lor C}{oc(\mathsf{T}s, C)} \text{NEGOUTERCLaus}$$

$$\frac{s \approx t \lor C}{s \approx \bot \lor t \approx \mathsf{T} \lor C \quad s \approx \mathsf{T} \lor t \approx \bot \lor C} \text{EQOUTERCLaus}$$

$$\frac{s \not\approx t \lor C}{s \approx \bot \lor t \approx \bot \lor C \quad s \approx \mathsf{T} \lor t \approx \mathsf{T} \lor C} \text{NEGOUTERCLaus}$$

The double bars identify simplification rules (i.e., the conclusions make the premise redundant and can replace it). The first two rules require that *s* has a logical symbol as its head, whereas the last two require that *s* and *t* are Boolean terms other than and ⊥. The function *oc* distributes the logical symbols over the clause *C*—e.g., *oc*(*s* → *t*, *C*) = {*s* ≈ ⊥ ∨ *t* ≈ ∨ *C*}, and *oc*(¬(*s* ∨*t*), *C*) = {*s* ≈ ⊥ ∨ *C*,*t* ≈ ⊥ ∨ *C*}. It is easy to check that our redundancy criterion allows us to replace the premise of the OUTERCLAUS rules with their conclusion. Nonetheless, we apply EQOUTERCLAUS and NEQOUTERCLAUS as inferences because the premises might be useful in their original form.

Besides the two delayed clausification methods, a third inprocessing clausification method is *immediate* clausification. This clausifies the input problem's outer Boolean structure in one swoop, resulting in a set of higher-order clauses. If unclausified Boolean terms rise to the top during saturation, the same algorithm is run to clausify them.

Unlike delayed clausification, immediate clausification is a black box and is unaware of the proof state other than the Boolean term it is applied to. Delayed clausification, on the other hand, clausifies the term step by step, allowing us to interleave clausification with the strong simplification machinery of superposition provers. It is especially powerful in higher-order contexts: Examples such as *y*p q ≈ (p ∨q) can be refuted directly by equality resolution, rather than via more explosive rules on the clausified form.

#### 4 Refutational Completeness

Our calculus is dynamically refutationally complete for problems in Q≈-normal form. The full proof can be found in our technical report [8].

Theorem 3 (Dynamic refutational completeness). *Let* (*Ni*)*<sup>i</sup> be a derivation—i.e., Ni* \*Ni*<sup>+</sup><sup>1</sup> <sup>⊆</sup> *Red*C(*Ni*<sup>+</sup>1) *for all <sup>i</sup>. Let <sup>N</sup>*<sup>0</sup> *be* <sup>Q</sup>≈*-normal and such that <sup>N</sup>*<sup>0</sup> <sup>|</sup><sup>=</sup> <sup>⊥</sup>*. Moreover, assume that* (*Ni*)*<sup>i</sup> is fair—i.e., all inferences from clauses in the limit inferior i <sup>j</sup>*≥*<sup>i</sup> Nj are contained in <sup>i</sup> Red*I(*Ni*)*. Then we have* ⊥ ∈ *Ni for some <sup>i</sup>.*

Following the completeness proof of λSup, our proof is structured in three levels of logics. For each, we define a calculus and show that it is refutationally complete: ground monomorphic first-order logic with an interpreted Boolean type (GF); the Q≈-normal ground fragment of higher-order logic (GH); and higher-order logic (H).

The logic of the GF level is the ground fragment of oSup's logic. The GF calculus is a ground version of oSup, which Nummelin et al. showed refutationally complete. It consists of ground first-order equivalents of our rules, excluding ARGCONG, FLUID-BOOLHOIST, and FLUIDLOOBHOIST, which are specific to higher-order logic. The counterparts to FORALLHOIST and EXISTSHOIST enumerate ground terms instead of producing free variables, to stay within the ground fragment. For compatibility with the nonground level, the conclusions of FORALLRW and EXISTSRW cannot contain concrete Skolem functions. Instead, the GF calculus is parameterized by a witness function that can assign an arbitrary term to each occurrence of a quantifier in a clause. This witness function is used to retrieve the Skolem terms in the GF equivalents of FORALLRW and EXISTSRW.

On the next level, the GH calculus includes inference rules isomorphic to the GF rules, transferred to higher-order logic via *F* <sup>−</sup>1. Moreover, it contains an ARGCONG variant that enumerates ground terms instead of introducing fresh variables, as well as rules enumerating ground instances of axioms (EXT) and (CHOICE). We prove refutational completeness of the GH calculus by constructing a higher-order interpretation based on the model constructed for the completeness proof of the GF level. This proof step is analogous to the corresponding step in λSup's proof, but we must also consider Q≈-normality and the logical symbols.

To lift completeness to the H level, we use the saturation framework of Waldmann et al. [31]. The main proof obligation it leaves us to show is that nonredundant GH inferences can be lifted to corresponding nonground H inferences. For this lifting, we must choose a suitable GH witness function and appropriate GH selection functions for literals and Boolean subterms, given a saturated clause set at the H level and the H selection functions. Then the saturation framework guarantees static refutational completeness w.r.t. Herbrand entailment, which is the entailment relation induced by the grounding function *G*. We then show that this implies dynamic refutational completeness w.r.t. |= for Q≈-normal initial clause sets.

#### 5 Implementation

We implemented our calculus in the Zipperposition prover [14], whose OCaml source code makes it convenient to prototype calculus extensions. Except for the presence of axioms (EXT) and (CHOICE), the new code gracefully extends Zipperposition's implementation of oSup in the sense that oλSup coincides with oSup on first-order problems. The same cannot be said w.r.t. λSup on Boolean-free problems because of the FLUIDBOOLHOIST and FLUIDLOOBHOIST rules, which are triggered by any applied variable. From the implementation of λSup, we inherit the given clause procedure, which supports infinitely branching inferences, as well as calculus extensions and heuristics [28]. From the implementation of oSup, we inherit the simplification rule BOOLSIMP, a mainstay of our Boolean simplification machinery.

As in the implementation of λSup, we approximate fluid terms as terms that are either nonground λ-expressions or terms of the form *x s*¯*<sup>n</sup>* with *n* > 0. Two slight, accidental discrepancies are that we also count variable occurrences below quantifiers as deep and perform EFACT inferences even if the maximal literal is selected. Since we expect FLUIDBOOLHOIST and FLUIDLOOBHOIST to be highly explosive, we penalize them and all of their offspring. In addition to various λSup extensions [6, Sect. 5], we also use all the rules for Boolean reasoning described by Vukmirovic and Nummelin [30] ´ except for the BOOLEF rules.

# 6 Evaluation

We evaluate the calculus implementation in Zipperposition and compare it with other higher-order provers. Our experiments were performed on StarExec Miami servers equipped with Intel Xeon E5-2620 v4 CPUs clocked at 2.10 GHz. We used all 2606 TH0 theorems from the TPTP 7.3.0 library [27] and 1253 "Judgment Day" problems [12] generated using Sledgehammer (SH) [24] as our benchmark set. An archive containing the benchmarks and the raw evaluation results is publicly available [5].

Calculus Evaluation. In this first part, we evaluate selected parameters of Zipperposition by varying only the studied parameter in a fixed well-performing configuration. This base configuration disables axioms (CHOICE) and (EXT) and the FLUID- rules. It uses the unification procedure of Vukmirovic et al. [29] in its complete variant—i.e., ´ the variant that produces a complete set of unifiers. It uses none of the early Boolean rules described by Vukmirovic and Nummelin [30]. The preprocessor ´ Q<sup>≈</sup> is disabled as well. All of the completeness-preserving simplification rules listed in Sect. 3 are enabled. The configuration uses immediate clausification. We set the CPU time limit to 30 s in all three experiments.

In the first experiment, we assess the overhead incurred by the FLUID- rules. These rules unify with a term whose head is a fresh variable. Thus, we expected that they needed to be tightly controlled to achieve good performance. To test our hypothesis, we simultaneously modified the parameters of these three rules. In Figure 1, the *off* mode simply disables the rules, the *pragmatic* mode uses a terminating incomplete unification algorithm (the pragmatic variant of Vukmirovic et al. [29]), and the ´ *complete* mode uses a complete unification algorithm. The results show that disabling FLUIDrules altogether achieves the best performance. However, on TPTP problems, *complete* finds 35 proofs not found by *off*, and *pragmatic* finds 22 proofs not found by *off*. On Sledgehammer benchmarks, this effect is much weaker, likely because the Sledgehammer benchmarks require less higher-order reasoning: *complete* finds only one new proof over *off*, and *pragmatic* finds only four.

In the second experiment, we explore the clausification methods introduced at the end of Sect. 3: *inner* delayed clausification, *outer* delayed clausification, and *immediate* clausification. The modes *inner* and *outer* employ oSup's RENAME rule, which renames Boolean terms headed by logical symbols using a Tseitin-like transformation if they occur at least four times in the proof state. Vukmirovic and Nummelin [30] observed ´ that *outer* clausification can greatly help prove higher-order problems, and we expected


Fig. 1. Evaluation of FLUID- rules

Fig. 2. Evaluation of clausification method


Fig. 3. Evaluation of axiom (CHOICE)

Fig. 4. Evaluation of all competitive higherorder provers

it to perform well for our calculus, too. The results are shown in Figure 2. The results confirm our hypothesis: The *outer* mode outperforms *immediate* on both TPTP and Sledgehammer benchmarks. The *inner* mode performs worst, but on Sledgehammer benchmarks, it proves 17 problems beyond the reach of the other two. Interestingly, several of these problems contain axioms of the form φ →ψ, and applying superposition and demodulation to these axioms is preferable to clausifying them.

In the third experiment, we investigate the effect of axiom (CHOICE), which is necessary to achieve refutational completeness. To evaluate (CHOICE), we either disabled it in a configuration labeled *off* or set the axiom's penalty *p* to different values. In Zipperposition, penalties are propagated through inference and simplification rules and are used to increase the heuristic weight of clauses, postponing the selection of penalized clauses. The results are shown in Figure 3. As expected, disabling (CHOICE), or at least penalizing it heavily, improves performance. Yet enabling (CHOICE) can be crucial: For 19 TPTP problems, the proofs are found when (CHOICE) is enabled and *p* = 4, but not when the rule is disabled. On Sledgehammer problems, this effect is weaker, with only two new problems proved for *p* = 4.

Prover Comparison. In this second part, we compare Zipperposition's performance with other higher-order provers. Like at CASC-J10, the wall-clock time limit was 120 s, the CPU time limit was 960 s, and the provers were run on StarExec Miami. We used the following versions of all systems that took part in the THF division: CVC4 1.8 [3], Leo-III 1.5.2 [26], Satallax 3.5 [13], and Vampire 4.5 [11]. The developers of Vampire have informed us that its higher-order schedule is optimized for running on a single core. As a result, the prover suffers some degradation of performance when running on multiple cores. We evaluate both the version of Zipperposition that took part in CASC-J10 (*Zip*) and the updated version of Zipperposition that supports our new calculus (*New Zip*). Zip's portfolio of prover configurations is based on λSup and techniques described by Vukmirovic and Nummelin [30]. New Zip's portfolio is specially designed for our ´ new calculus and optimized for TPTP problems. To assess the performance of Boolean reasoning, we used Sledgehammer benchmarks generated both with native Booleans (SH) and with an encoding into Boolean-free higher-order logic (ofSH). For technical reasons, the encoding also performs λ-lifting, but this minor transformation should have little impact on results [6, Sect. 7].

The results are shown in Figure 4. The two versions of Zipperposition are ahead of all other provers on both benchmark sets. This shows that, with thorough parameter tuning, higher-order superposition outperforms tableaux, which had been the state of the art in higher-order reasoning for a decade. The updated version of New Zip beats Zip on TPTP problems but lags behind Zip on Sledgehammer benchmarks as we have yet to further explore more general heuristics that work well with our new calculus. The Sledgehammer benchmarks fail to demonstrate the superiority of native Booleans reasoning compared with an encoding, and in fact CVC4 and Leo-III perform dramatically better on the encoded Boolean problems, suggesting that there is room for tuning.

# 7 Conclusion

We have created a superposition calculus for higher-order logic that is refutationally complete. Most of the key ideas have been developed in previous work by us and colleagues, but combining them in the right way has been challenging. A key idea was to Q≈-normalize away inconvenient terms.

Unlike earlier refutationally complete calculi for full higher-order logic based on resolution or paramodulation, our calculus employs a term order, which restricts the proof search, and a redundancy criterion, which can be used to add various simplification rules while keeping refutational completeness. These two mechanisms are undoubtedly major factors in the success of first-order superposition, and it is very fortunate that we could incorporate both in a higher-order calculus. An alternative calculus with the same two mechanisms could be achieved by combining oSup with Bhayat and Reger's combinatory superposition [10]. The article on λSup [6, Sect. 8] discusses related work in more detail.

The evaluation results show that our calculus is an excellent basis for higher-order theorem proving. In future work, we want to experiment further with the different parameters of the calculus (for example, with Boolean subterm selection heuristics) and implement it in a state-of-the-art prover such as E.

Acknowledgment. Uwe Waldmann provided advice and carefully checked the completeness proof. Visa Nummelin led the design of the oSup calculus. Simon Cruanes helped us with the implementation. Martin Desharnais generated the Sledgehammer benchmarks. Christoph Benzmüller, Ahmed Bhayat, Mathias Fleury, Herman Geuvers, Giles Reger, Alexander Steen, Mark Summerfield, Geoff Sutcliffe, and the anonymous reviewers helped us in various ways. We thank them all.

Bentkamp, Blanchette, and Vukmirovic's research has received funding from the ´ European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 713999, Matryoshka). Blanchette's research has received funding from the Netherlands Organization for Scientific Research (NWO) under the Vidi program (project No. 016.Vidi.189.037, Lean Forward).

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Implementation and Application**

# **Making Higher-Order Superposition Work**

Petar Vukmirovi´c<sup>1</sup> , Alexander Bentkamp<sup>1</sup> , Jasmin Blanchette1,2,<sup>3</sup> , Simon Cruanes<sup>4</sup> , Visa Nummelin<sup>1</sup> , and Sophie Tourret2,<sup>3</sup>

<sup>1</sup> Vrije Universiteit Amsterdam, Amsterdam, the Netherlands {p.vukmirovic,a.bentkamp,j.c.blanchette,visa.nummelin}@vu.nl <sup>2</sup> Universit´e de Lorraine, CNRS, Inria, LORIA, Nancy, France sophie.tourret@inria.fr <sup>3</sup> Max-Planck-Institut f¨ur Informatik, Saarbr¨ucken, Germany <sup>4</sup> Aesthetic Integration, Austin, Texas, USA simon@imandra.ai

**Abstract.** Superposition is among the most successful calculi for firstorder logic. Its extension to higher-order logic introduces new challenges such as infinitely branching inference rules, new possibilities such as reasoning about formulas, and the need to curb the explosion of specific higher-order rules. We describe techniques that address these issues and extensively evaluate their implementation in the Zipperposition theorem prover. Largely thanks to their use, Zipperposition won the higher-order division of the CASC-J10 competition.

#### **1 Introduction**

In recent decades, superposition-based first-order automatic theorem provers have emerged as useful reasoning tools. They dominate at the annual CASC [45] theorem prover competitions, having always won the first-order theorem division. They are also used as backends to proof assistants [13, 25, 35], automatic higher-order theorem provers [42], and software verifiers [17]. The superposition calculus has only recently been extended to higher-order logic, resulting in λ-superposition [6], which we developed together with Waldmann, as well as combinatory superposition [10] by Bhayat and Reger.

Both higher-order superposition calculi were designed to gracefully extend first-order reasoning. As most steps in higher-order proofs tend to be essentially first-order, extending the most successful first-order calculus to higher-order logic seemed worth trying. Our first attempt at corroborating this conjecture was in 2019: Zipperposition 1.5, based on λ-superposition, finished third in the higherorder theorem division of CASC-27 [47], 12 percentage points behind the winner, the tableau prover Satallax 3.4 [11].

Studying the competition results, we discovered that higher-order tableaux have some advantages over higher-order superposition. To bridge the gap, we developed techniques and heuristics that simulate the behavior of a tableau prover in the context of saturation. We implemented them in Zipperposition 2, which took part in CASC-J10 in 2020. This time, Zipperposition won the division, solving 84% of problems, a whole 20 percentage points ahead of the next best prover, Satallax 3.4. In this paper, we describe the main techniques that explain this reversal of fortunes. They range from preprocessing to backend integration.

Interesting patterns can be observed in various higher-order encodings of problems. We show how we can exploit these to simplify problems (Sect. 3). By working on formulas rather than clauses, tableau techniques take a more holistic view of a higher-order problem. Delaying the clausification through the use of calculus rules that act on formulas achieves the same effect in superposition. We further explore the benefits of this approach (Sect. 4).

The main drawback of λ-superposition compared with combinatory superposition is that it relies on rules that enumerate possibly infinite sets of unifiers. We describe a mechanism that interleaves performing infinitely branching inferences with the standard saturation process (Sect. 5). The prover retains the same behavior as before on first-order problems, smoothly scaling with increasing numbers of higher-order clauses. We also propose some heuristics to curb the explosion induced by highly prolific λ-superposition rules (Sect. 6).

Using first-order backends to finish the proof is common practice in higherorder reasoning. Since λ-superposition coincides with standard superposition on first-order clauses, invoking backends may seem redundant; yet Zipperposition is nowhere as efficient as E [38] or Vampire [28], so invoking a more efficient backend does make sense. We describe how to achieve a balance between allowing native higher-order reasoning and delegating reasoning to a backend (Sect. 7).

Finally, we compare Zipperposition 2 with other provers on all monomorphic higher-order TPTP benchmarks [46] to perform a more extensive evaluation than at CASC (Sect. 8). Our evaluation corroborates the competition results.

# **2 Background and Setting**

We focus on monomorphic higher-order logic, but the techniques can easily be extended with polymorphism. Indeed, Zipperposition already supports some techniques polymorphically.

**Higher-Order Logic.** We define terms s, t, u, v inductively as free variables F, X, bound variables x, y, z, . . . , constants f, g, a, b,... , applications s t, and λabstractions λx. s. The syntactic distinction between free and bound variables gives rise to loose bound variables (e.g., y in λx. y a) [32]. We let s t<sup>n</sup> stand for s t<sup>1</sup> ... t<sup>n</sup> and λxn. s for λx1. . . . λxn. s. Every β-normal term can be written as λxm. s tn, where s is not an application; we call s the head of the term. If the type of a term <sup>t</sup> is of the form <sup>τ</sup><sup>1</sup> → ··· → <sup>τ</sup><sup>n</sup> <sup>→</sup> <sup>o</sup>, where <sup>o</sup> is the distinguished Boolean type and <sup>n</sup> <sup>≥</sup> 0, we call <sup>t</sup> <sup>a</sup> predicate. A literal <sup>l</sup> is an equation <sup>s</sup> <sup>≈</sup> <sup>t</sup> or a disequation <sup>s</sup> ≈ <sup>t</sup>. A clause is a finite multiset of literals, interpreted and written disjunctively <sup>l</sup><sup>1</sup> ∨ ··· ∨ <sup>l</sup>n. Logical symbols that may occur within terms are written in boldface: <sup>¬</sup>,∧,∨,→,↔,... . Predicate literals are encoded as (dis)equations with based on their sign; for example, even(x) becomes even(x) <sup>≈</sup>, and <sup>¬</sup> even(x) becomes even(x) ≈ .

**Higher-Order Calculi.** The λ-superposition calculus is a refutationally complete inference system and redundancy criterion for Boolean-free extensional polymorphic clausal higher-order logic. The calculus relies on complete sets of unifiers (CSUs). The CSU for s and t with respect to a set of variables V , denoted by CSU<sup>V</sup> (s, t), is a set of unifiers such that for any unifier of s and t, there exist substitutions <sup>σ</sup> <sup>∈</sup> CSU<sup>V</sup> (s, t) and <sup>θ</sup> such that (X) = <sup>σ</sup>(θ(X)) for all variables <sup>X</sup> <sup>∈</sup> <sup>V</sup> . The set <sup>X</sup> is used to distinguish between important and auxiliary variables. We usually omit it. A pragmatic, incomplete extension of λ-superposition with interpreted Booleans is described by Vukmirovi´c and Nummelin [51]. This forms the basis of most of this work. Recently, a refutationally complete extension was developed by Bentkamp et al. [5]; it is not considered here.

By contrast, the combinatory superposition calculus avoids CSUs by using a form of first-order unification, but essentially it enumerates higher-order terms using rules that instantiate applied variables with partially applied combinators from the complete combinator set {S,K, <sup>B</sup>, <sup>C</sup>, <sup>I</sup>}. This calculus is the basis of Vampire 4.5 [10], which finished closely behind Satallax 3.4 at CASC-J10.

A different, very successful calculus is Satallax's SAT-guided tableaux [2]. Satallax was the leading higher-order prover of the 2010s. Its simple and elegant tableaux avoid deep superposition-style rewriting inferences. Nevertheless, our working hypothesis for the past six years has been that superposition would likely provide a stronger basis for higher-order reasoning. Other competing higher-order calculi include SMT (implemented in CVC4 [3, 4]) and extensional paramodulation (implemented in Leo-III [42]).

**Zipperposition.** Zipperposition [6, 12] is a higher-order theorem prover based on a pragmatic extension of λ-superposition. It was conceived as a testbed for rapidly experimenting with extensions of first-order superposition, but over time, it has assimilated many of E's techniques and heuristics. Zipperposition 2 also implements combinatory superposition.

Several of our techniques extend the given clause procedure [30, Section 2.3], the standard saturation procedure. It partitions the proof state into a set P of passive clauses and a set A of active clauses. Initially, P contains all input clauses, and A is empty. At each iteration, a given clause C from P is moved to A (i.e., it is activated), all inferences between C and clauses in A are performed, and the conclusions are added to P. Because Zipperposition fully simplifies clauses only when they are activated, it implements a DISCOUNT-style loop [14].

**Experimental Setup.** To assess our techniques, we carried out experiments with Zipperposition 2. We used all 2606 monomorphic higher-order problems from the TPTP library [46], version 7.2.0, as benchmarks. Although some techniques support polymorphism, we uniformly used the monomorphic benchmarks. We fixed a base configuration of Zipperposition parameters as a baseline for all comparisons. Then, in each experiment, we varied the parameters associated with a specific technique to evaluate it. The experiments were run on StarExec [43] servers, equipped with Intel Xeon E5-2609 CPUs clocked at 2.40 GHz. Unless otherwise stated, we used a CPU time limit of 20 s, roughly the time each configuration is given in the portfolio mode used for CASC. The raw evaluation results are available online.<sup>5</sup>

# **3 Preprocessing Higher-Order Problems**

The TPTP library contains thousands of higher-order problems. Despite their diversity, they have a markedly different flavor from the TPTP first-order problems. Notably, they extensively use the definition role to identify universally quantified equations (or equivalences) that define symbols.

Definitions can be replaced by rewrite rules, using the orientation given in the input problem. If there are multiple definitions for the same symbol, only the first one is replaced by a rewrite rule. Then, whenever a clause is picked in the given clause procedure, it will be rewritten using the collected rules. Since the TPTP format enforces no constraints on definitions, rewriting might diverge. To ensure termination, we limit the number of applied rewrite steps. In practice, most TPTP problems are well behaved: Only one definition is given for each symbol, and the definitions are acyclic. Instead of rewriting a clause when it is activated, we can rewrite the input formulas as a preprocessing step. This ensures that the input clauses will be fully simplified when the proving process starts and no defined symbols will occur in clauses, which usually helps the heuristics.

Eagerly unfolding the definitions and β-reducing can eliminate all of a problem's higher-order features, making it amendable to first-order methods. However, this can inflate the problem beyond recognition and compromise the refutational completeness of superposition.

To keep completeness, we can try to orient the definitions using the term order that parameterized superposition and rely on demodulation to simplify the proof state. Usually, the Knuth–Bendix order (KBO) [26] is used. It compares terms by first comparing their weights, which is the sum of all the weights assigned to the symbols it contains. Given a symbol weight assignment W, we can update it so that it orients acyclic definitions from left to right assuming that they are of the form <sup>f</sup> <sup>X</sup><sup>m</sup> <sup>≈</sup> λyn. t, where the only free variables in <sup>t</sup> are <sup>X</sup>m, no free variable repeats or appears applied in t, and f does not occur in t. Then we traverse the symbols f that are defined by such equations following the dependency relation, starting with a symbol f that does not depend on any other defined symbol. For each <sup>f</sup>, we set <sup>W</sup>(f) to <sup>w</sup> + 1, where <sup>w</sup> is the maximum weight of the right-hand sides of f's definitions, computed using W. By construction, for each equation the left-hand side is heavier. Thus, the equations are orientable from left to right.

**Evaluation and Discussion.** The base configuration treats axioms annotated with definition as rewrite rules, and it preprocesses the formulas using the rewrite rules. We also tested the effects of disabling this preprocessing (−preprocess), disabling the special treatment of definition axioms (−RW), and disabling the special treatment of definition while using adjusted KBO

<sup>5</sup> https://doi.org/10.5281/zenodo.4534829



Fig. 1: Effect of the definition rewriting methods

Fig. 2: Effect of clausification and lightweight AVATAR

weights as described above (−RW+KBO). The results are given in Figure 1. In all of the figures in this paper, each cell gives the number of proved problems; the highest number is typeset in bold. Clearly, treating definition axioms as rewrite rules greatly improves performance. Using adjusted KBO weights is not as strong, although it proves 15 problems not proved using other configurations.

#### **4 Reasoning about Formulas**

Higher-order logic identifies terms and formulas. To prove a problem, we often need to instantiate a variable with the right predicate. Finding this predicate can be easier if the problem is not clausified. Consider the conjecture <sup>∃</sup>f.f p q <sup>↔</sup> <sup>p</sup>∧q. Expressed in this form, the formula is easy to prove by taking <sup>f</sup> := λx y. x <sup>∧</sup> <sup>y</sup>. By contrast, guessing the right instantiation for the negated, clausified form <sup>F</sup> p q ≈ ∨ <sup>p</sup> ≈ ∨ <sup>q</sup> ≈ , F p q ≈ ∨ <sup>p</sup> <sup>≈</sup>, <sup>F</sup> p q ≈ ∨ <sup>q</sup> <sup>≈</sup> is more challenging. One of the strengths of higher-order tableau provers is that they do not clausify the input problem. This might explain Satallax's dominance in the THF division of CASC competitions until CASC-J10.

We studied techniques to incrementally clausify formulas during proof search in incomplete [51] and complete [5] extensions of λ-superposition. Both approaches include the same set of (outer ) delayed clausification rules that clausify top-level logical symbols, proceeding outside in; for example, a clause <sup>C</sup> <sup>∨</sup> (<sup>p</sup> <sup>∧</sup> <sup>q</sup>) ≈ is transformed into <sup>C</sup> <sup>∨</sup> <sup>p</sup> ≈ ∨ <sup>q</sup> ≈ . The complete approach requires additional inference rules; it also supports inner delayed clausification. We focus on the pragmatic, incomplete approach and do not consider inner clausification due to its poor performance [5].

Delayed clausification rules can be used as inference rules (which add conclusions to the passive set) or as simplification rules (which delete premises and add conclusions to the passive set). Inferences are more flexible because they produce all intermediate clausification states, whereas simplifications produce fewer clauses. Since clausifying equivalences can destroy a lot of syntactic structure [18], we never apply simplifying clausification rules on them.

We discuss two tableau-inspired approaches for reasoning about formulas. First, we study how clause-splitting techniques interfere with delayed clausification. Second, we discuss heuristic instantiation of quantifiers during saturation.

Zipperposition supports a lightweight variant of AVATAR [49], an architecture that partitions the search space by splitting clauses into variable-disjoint subclauses. This variant of AVATAR is described by Ebner et al. [15]. Combining lightweight AVATAR and delayed clausification makes it possible to split a clause (ϕ<sup>1</sup> ∨···∨ <sup>ϕ</sup>n) <sup>≈</sup>, where the <sup>ϕ</sup>i's are arbitrarily complex formulas that share no free variables with each other, into clauses <sup>ϕ</sup><sup>i</sup> <sup>≈</sup>.

To finish the proof, it suffices to derive <sup>⊥</sup> under each assumption <sup>ϕ</sup><sup>i</sup> <sup>≈</sup>. Since the split is performed at the formula level, this technique resembles tableaux, but it exploits the strengths of superposition, such as its powerful redundancy criterion and simplification machinery, to close the branches.

Interleaving clausification and saturation allows us to simulate another tableau technique. Whenever dynamic clausification replaces the predicate variable <sup>x</sup> in a clause of the form (∀x. ϕ) ≈ ∨ <sup>C</sup> with a fresh variable <sup>X</sup>, resulting in <sup>ϕ</sup>{<sup>x</sup> → <sup>X</sup>}≈ ∨ <sup>C</sup>, we can create additional clauses in which <sup>x</sup> is replaced with <sup>t</sup> <sup>∈</sup> Inst, where Inst is a set of heuristically chosen terms. This set contains λ-abstractions whose bodies are formulas and which occur in activated clauses, and primitive instantiations [51]—that is, imitations (in the sense of higher-order unification) of logical symbols that approximate the shape of a predicate that can instantiate a predicate variable.

However, as a new term t can be added to Inst after a clause with a quantified variable of the same type as t has been activated, we must also keep track of the clauses <sup>ϕ</sup>{<sup>x</sup> → <sup>X</sup>}≈ ∨ <sup>C</sup>, so that when Inst is extended, we instantiate the saved clauses. Conveniently, instantiated clauses are not recognized as subsumed, since Zipperposition uses an optimized but incomplete subsumption algorithm.

Given a disequation <sup>f</sup> <sup>s</sup><sup>n</sup> ≈ <sup>f</sup> <sup>t</sup>n, the abstraction of <sup>s</sup><sup>i</sup> is λx. u <sup>≈</sup> <sup>v</sup>, where <sup>u</sup> is obtained by replacing s<sup>i</sup> with x in f s<sup>n</sup> and v is obtained by replacing s<sup>i</sup> with x in <sup>f</sup> <sup>t</sup>n. For <sup>f</sup> <sup>s</sup><sup>n</sup> <sup>≈</sup> <sup>f</sup> <sup>t</sup>n, the analogous abstraction is λx.<sup>¬</sup> (<sup>u</sup> <sup>≈</sup> <sup>v</sup>).

Adding abstractions of the conjecture literals to Inst can provide useful instantiations for formulas such as induction principles for datatypes. As the conjecture is negated, the equation's polarity is inverted in the abstraction. Consider the TPTP problem DAT056^2 [44], whose clausified negated conjecture is ap xs(ap ys zs) ≈ ap (ap xs ys) zs, where ap is the append operator defined recursively on its first argument and xs, ys, and zs are of list type. Abstracting xs from the disequation yields <sup>t</sup> <sup>=</sup> λx. ap <sup>x</sup> (ap ys zs) <sup>≈</sup> ap (ap <sup>x</sup> ys) zs, which is added to Inst. Included in the problem is the induction axiom for the list datatype: <sup>∀</sup>p.(<sup>p</sup> nil <sup>∧</sup> (∀<sup>x</sup> xs. p xs <sup>→</sup> <sup>p</sup> (cons <sup>x</sup> xs))) → ∀xs. p xs, where nil and cons have the usual meanings. Instantiating p with t and using the ap definition, we can prove <sup>∀</sup>x. ap <sup>x</sup> (ap ys zs) <sup>≈</sup> ap (ap <sup>x</sup> ys) zs, from which we easily derive a contradiction.

**Evaluation and Discussion.** The base configuration uses immediate clausification (IC), an approach that applies a standard clausification algorithm [33] both as a preprocessing step and whenever predicate variables are instantiated. Zipperposition's lightweight AVATAR is disabled in the base configuration. To test the merits of delayed clausification, we vary base's parameters along two axes: We choose immediate clausification (IC), delayed clausification as inference (DCI), or delayed clausification as simplification (DCS), and we either enable (+LA) or disable (−LA) the lightweight AVATAR. The base configuration does not use instantiation with terms from Inst.

Figure 2 shows that using delayed clausification as simplification greatly increases the success rate, while using delayed clausification as inference has the opposite effect. Manually inspecting the proofs found by the DCS configuration, we noticed that a main reason for its success is that it does not simplify away equivalences. Overall, the lightweight AVATAR harms performance, but the sets of problems proved with and without it are vastly different. For example, the IC+LA configuration proves 60 problems not proved by IC−LA.

The Boolean instantiation technique presented above requires delayed clausification. To test its effects, we enabled it in the best configuration from Figure 2, DCS−LA. With this change, Zipperposition proves 1744 problems, 36 of which cannot be proved by any other configuration in the same figure. Boolean instantiation is the only way in which Zipperposition 2 can prove higher-order problems requiring reasoning about induction axioms (e.g., DAT056^2).

#### **5 Enumerating Infinitely Branching Inferences**

As an optimization and to simplify the implementation, Leo-III [40] and Vampire 4.4 [9] (which uses a predecessor of combinatory superposition) compute only a finite subset of the possible conclusions for inferences that require enumerating a CSU. Not only is this a source of incompleteness, but choosing the cardinality of the computed subset is a difficult heuristic choice. Small sets can result in missing the unifier necessary for the proof, whereas large sets make the prover spend a long time in the unification procedure, generate useless clauses, and possibly get sidetracked into the wrong parts of the search space.

We propose a modification to the given clause procedure to seamlessly interleave unifier computation and proof state exploration. Given a complete unification procedure, which may yield infinite streams of unifiers, our modification fairly enumerates all conclusions of inferences relying on elements of a CSU. Under some reasonable assumptions, it behaves exactly like the standard given clause procedure on purely first-order problems. We also describe heuristics that help achieve a similar performance as when using incomplete, terminating unification procedures without sacrificing completeness.

Given the undecidability of the question as to whether there exists a next CSU element in a stream of unifiers, the request for the next conclusion might not terminate, effectively bringing the theorem prover to a halt. Our modified given clause procedure expects the unification procedure to return a lazily computed stream [34, Sect. 4.2], each element of which is either ∅ or a singleton set containing a unifier. To avoid getting stuck waiting for a unifier that may not exist, the unification procedure should return ∅ after it performs a number of operations without finding a unifier.

The complete unification procedure by Vukmirovi´c et al. [52] returns such a stream. Other procedures such as Huet's [22] and Jensen and Pietrzykowski's [23] can easily be adapted to meet this requirement. Based on the stream of unifiers interspersed with ∅, we can construct a stream of inferences similarly interspersed with ∅ of which any finite prefixes can be computed in finite time.

To support such streams in the given clause procedure, we extend it to represent the proof state not only by the active (A) and passive (P) clause sets, but also by a priority queue Q containing the inference streams. Each stream is associated with a weight, and Q is sorted in order of increasing weight. Elsewhere [6], Bentkamp et al. described an older version of this extension. Here we present a newer version in more detail, including heuristics to postpone unpromising streams. The pseudocode of the modified procedure is as follows:

# **function** ExtractClause(Q, stream)

maybe clause ← pop and compute the first element of stream **if** stream is not empty **then** add stream to Q with an increased weight **return** maybe clause

```
function HeuristicProbe(Q)
  (collected clauses, i) ← (∅, 0)
  while i<Kbest and Q is not empty do
    (maybe clause, j) ← (∅, 0)
    while j<Kretry and Q is not empty and maybe clause = ∅ do
      stream ← pop the lowest weight stream in Q
      maybe clause ← ExtractClause(Q, stream)
      j ← j + 1
    collected clauses ← collected clauses ∪ maybe clause
    i ← i + 1
  return collected clauses
function FairProbe(Q, num oldest)
  collected clauses ← ∅
  oldest streams ← pop num oldest oldest streams from Q
  for stream in oldest streams do
    collected clauses ← collected clauses ∪ ExtractClause(Q, stream)
  return collected clauses
function ForceProbe(Q)
  collected clauses ← ∅
  while Q is not empty and collected clauses = ∅ do
    collected clauses ← FairProbe(Q, |Q|)
  if Q and collected clauses are empty then status ← Satisfiable
  else status ← Unknown
  return (status, collected clauses)
function GivenClause(P, A, Q)
  (status, i) ← (Unknown, 0)
  while status = Unknown do
    if P is not empty then
      given ← pop a chosen clause from P and simplify it
      if given is the empty clause then status ← Unsatisfiable
```
# **else** <sup>A</sup> <sup>←</sup> <sup>A</sup> ∪ {given} **for** stream in streams of inferences between given and other <sup>∈</sup> <sup>A</sup> **do if** stream is not empty **then** <sup>P</sup> <sup>←</sup> <sup>P</sup>∪ExtractClause(Q, stream) <sup>i</sup> <sup>←</sup> <sup>i</sup> + 1 **if** <sup>i</sup> mod <sup>K</sup>fair = 0 **then** <sup>P</sup> <sup>←</sup> <sup>P</sup> <sup>∪</sup> FairProbe(Q, i/Kfair) **else** <sup>P</sup> <sup>←</sup> <sup>P</sup> <sup>∪</sup> HeuristicProbe(Q) **else**

```
(status, forced clauses) ← ForceProbe(Q)
P ← P ∪ forced clauses
```
**return** status

Initially, all input clauses are put into P, and A and Q are empty. Unlike in the standard given clause procedure, inference results are represented as clause streams. The first element is inserted into P, and the rest of the stream is stored in Q with some positive integer weight computed from the inference rule.

To eventually consider inference conclusions from streams in Q as given clauses, we extract elements from, or probe, streams and move any obtained clauses to P. Analogously to the traditional pick–given ratio [30, 37], we use a parameter Kfair (by default, Kfair = 70) to ensure fairness: Every Kfairth iteration, FairProbe probes an increasing number of oldest streams, which achieves dovetailing. In all other iterations, HeuristicProbe attempts to extract up to Kbest clauses from the most promising streams (by default, Kbest = 7). In each attempt, the most promising stream in <sup>Q</sup> is chosen. If its first element is <sup>∅</sup>, the rest of the stream is inserted into Q, and a new stream is chosen. This is repeated until either <sup>K</sup>retry occurrences of <sup>∅</sup> have been met (by default, <sup>K</sup>retry = 20) or the stream yields a singleton set. Setting Kretry > 0 increases the chance that HeuristicProbe will return Kbest clauses, as desired. Finally, if P becomes empty, ForceProbe searches relentlessly for a clause in Q, as a fallback.

The function ExtractClause extracts an element from a nonempty stream not in Q and inserts the remaining stream into Q with an increased weight, calculated as follows. Let n be the number of times the stream was chosen for probing. If probing results in <sup>∅</sup>, the stream's weight is increased by max {2, n <sup>−</sup> <sup>16</sup>}. If probing results in a clause C whose penalty is p, the stream's weight is increased by <sup>p</sup> · max {1, n <sup>−</sup> <sup>64</sup>}. The penalty of a clause is a number assigned by Zipperposition based on features such as the depth of its derivation and the rules used in it. The constants 16 and 64 increase the chance that newer streams are picked, which is desirable because their first clauses are expected to be useful.

All three probing functions are invoked by GivenClause, which forms the body of the saturation loop. It differs from the standard given clause procedure in three ways: First, the proof state includes Q in addition to P and A. Second, new inferences involving the given clause are added to Q instead of being performed immediately. Third, inferences in Q are periodically performed lazily to fill P.

GivenClause eagerly stores the first element of a new inference stream in P to imitate the standard given clause procedure. If the underlying unification procedure behaves like the standard first-order unification algorithm on higherorder logic's first-order fragment, our given clause procedure coincides with the standard one. The unification procedure by Vukmirovi´c et al. terminates on the first-order and other fragments [32], and for problems outside these fragments, it immediately returns ∅ to avoid computing complicated unifiers eagerly.

**Evaluation and Discussion.** When the unification procedure of Vukmirovi´c et al. was implemented in Zipperposition, it was observed that Zipperposition is the only competing higher-order prover that proves all Church numeral problems from the TPTP, never spending more than 5 seconds on the problem [52].

Consider the TPTP problem NUM800^1, which requires finding a function F such that <sup>F</sup> <sup>c</sup><sup>1</sup> <sup>c</sup><sup>2</sup> <sup>≈</sup> <sup>c</sup><sup>2</sup> <sup>∧</sup> <sup>F</sup> <sup>c</sup><sup>2</sup> <sup>c</sup><sup>3</sup> <sup>≈</sup> <sup>c</sup>6, where <sup>c</sup><sup>n</sup> abbreviates the Church numeral for n, λs z. s<sup>n</sup>(z). To prove it, it suffices to take F to be the multiplication operator λx y s z.x (y s) z. However, this unifier is only one out of many available for each occurrence of F.

In an independent evaluation setup on the same set of 2606 problems used in this paper, Vukmirovi´c et al. compared a complete, nonterminating variant and a pragmatic, terminating variant of the unification procedure [52, Sect. 7]. The pragmatic variant was used directly—all the inference conclusions were put immediately in P, bypassing Q. The complete variant, which relies on possibly infinite streams and is much more prolific, proved only 15 problems less than the most competitive pragmatic variant. Furthermore, it proved 19 problems not proved by the pragmatic variant. This shows that our given clause procedure, with its heuristics, allows the prover to defer exploring less promising branches of the unification and uses the full power of a complete higher-order unifier search to solve unification problems that cannot be solved by a crippled procedure.

Among the competing higher-order theorem provers, only Satallax uses infinitely branching calculus rules. It maintains a queue of "commands" that contain instructions on how to create a successor state in the tableau. One command describes infinite enumeration of all closed terms of a given function type. Each execution of this command makes progress in the enumeration. Unlike evaluation of streams representing elements of CSU, each command execution is guaranteed to make progress in enumerating the next closed functional term, so there is no need to ever return ∅.

## **6 Controlling Prolific Rules**

To support higher-order features such as function extensionality and quantification over functions, many refutationally complete calculi employ highly prolific rules. For example, λ-superposition uses a rule FluidSup [6] that very often applies to two clauses if one of them contains a term of the form F sn, where n > 0. We describe three mechanisms to keep rules like these under control.

First, we limit applicability of the prolific rules. In practice, it often suffices to apply prolific higher-order rules only to initial or shallow clauses—clauses with a shallow derivation depth. Thus, we added an option to forbid the application of a rule if the derivation depth of any premise exceeds a limit.

Second, we penalize the streams of expensive inferences. The weight of each stream is given an initial value based on characteristics of the inference premises such as their derivation depth. For prolific rules such as FluidSup, we increment this value by a parameter Kincr. Weights for less prolific variants of this rule, such as DupSup [6], are increased by a fraction of <sup>K</sup>incr (e.g., Kincr/3).

Third, we defer the selection of prolific clauses. To select the given clause, most saturating provers evaluate clauses according to some criteria and select the clause with the lowest evaluation. For this choice to be efficient, passive clauses are organized into a priority queue ordered by their evaluations. Like E, Zipperposition maintains multiple queues, ordered by different evaluations, that are visited in a round-robin fashion. It also uses E's two-layer evaluation functions, a variant of which has recently been implemented in Vampire [19]. The two layers are clause priority and clause weight. Clauses with higher priority are preferred, and the weight is used for tie-breaking. Intuitively, the first layer crudely separates clauses into priority classes, whereas the second one uses heuristic weights to prefer clauses within a priority class. To control the selection of prolific clauses, we introduce new clause priority functions that take into account features specific to higher-order clauses.

The first new priority function PreferHOSteps (PHOS) assigns a higher priority if rules specific to λ- or combinatory superposition were used in the clause derivation. Since most of the other clause priority functions tend to defer higherorder clauses, having a clause queue that prefers the results of higher-order inferences might be necessary to find a proof more efficiently. A simpler function, which prefers clauses containing λ-abstractions, is PreferLambda (PL).

We also introduce the priority function ByNormalizationFactor (BNF), inspired by the observation that a higher-order inference that applies a complicated substitution to a clause is usually followed by a βη-normalization step. If βη-normalization greatly reduces the size of a clause, it is likely that this substitution simplifies the clause (e.g., by removing a variable's arguments). Thus, this function prefers clauses that were produced by βη-normalization, and among those it prefers the ones with larger size reductions.

Another new priority function is PreferShallowAppVars (PSAV). This prefers clauses with lower depths of the deepest occurrence of an applied variable—that is, C[X a] is preferred over C[f (X a)]. This function tries to curb the explosion of both λ- and combinatory superposition: Applying a substitution to a top-level applied variable often reduces this applied variable to a term with a constant head, which likely results in a less explosive clause. Among the functions that rely on properties of applied variables we implemented PreferDeepAppVars (PDAV), which returns the priority opposite of PSAV, and ByAppVarNum (BAVN), which prefers clauses with fewer occurrences of applied variables.

**Evaluation and Discussion.** In the base configuration, Zipperposition visits several clause queues, one of which uses the constant priority function ConstPrio (CP). To evaluate the new priority functions, we replaced the queue ordered by CP with the queue ordered by one of the new functions, leaving the clause weight intact. The results are shown in Figure 3. It shows that the expensive priority


Fig. 3: Effect of the priority function on performance


Fig. 4: Effect of the FluidSup weight increment Kincr on performance

functions PHOS and BNF, which require inspecting the proof of clauses, hardly help. Simple functions such as PL are more effective: Compared with base, PL loses one problem overall but proves 22 new problems.

FluidSup is disabled in base because it is so explosive. To test if increasing inference stream weights makes a difference on the success rate, we enabled FluidSup and used different weight increments Kincr for FluidSup inference queues. The results are shown in Figure 4. As expected, using a low increment with FluidSup is detrimental to performance. However, as the column for Kincr = 16 shows, nor should we use too high an increment, since that delays useful FluidSup inferences. Interestingly, even though the configuration with Kincr = 1 proves the least problems overall, it proves 7 problems not proved by base, which is more than any other configuration we tried.

#### **7 Controlling the Use of Backends**

Cooperation with efficient first-order theorem provers is an essential feature of higher-order theorem provers such as Leo-III [40, Sect. 4.4] and Satallax [11]. Those provers invoke first-order backends repeatedly during a proof attempt and spend a substantial amount of time in backend collaboration. Since λ-superposition generalizes a highly efficient first-order calculus, we expect that future efficient λ-superposition implementations will not benefit much from backends. Experimental provers such as Zipperposition can still gain a lot. We present some techniques for controlling the use of backends.

In his thesis [40, Sect. 6.1], Steen extensively evaluates the effects of using different first-order backends on the performance of Leo-III. His results suggest that adding only one backend already substantially improves the performance. To reduce the effort required for integrating multiple backends, we chose Ehoh [50] as our single backend. Ehoh is an extension of the highly optimized superposition prover E with support for higher-order features such as partial application, applied variables, and interpreted Booleans. On the one hand, Ehoh provides the efficiency of E while easing the translation from full higher-order logic: The only missing syntactic feature is λ-abstraction. On the other hand, Ehoh's higher-


Fig. 5: Effect of the backend invocation point Ktime

Fig. 6: Effect of the method used to translate λ-abstractions


Fig. 7: Effect of the number of selected clauses Ksize

order reasoning capabilities are limited. Its unification algorithm is essentially first-order and it cannot synthesize λ-abstractions.

In a departure from Leo-III and other cooperative provers, we invoke the backend at most once during a run of the prover. This is because most competitive higher-order provers use a portfolio mode in which many configurations are run for a short time, and we want to leave enough time for native higher-order reasoning. Moreover, multiple backend invocations tend to be wasteful, because currently each invocation starts with no knowledge of the previous ones.

Only a carefully chosen subset of the available clauses are translated and sent to Ehoh. Let <sup>I</sup> be the set of input clauses. Given a proof state, let <sup>M</sup> <sup>=</sup> <sup>P</sup> <sup>∪</sup> <sup>A</sup>, and let Mho denote the subset of M that contains only clauses that were derived using at least one λ-superposition-specific inference rule. We order the clauses in Mho by increasing derivation depth, using syntactic weight to break ties. Then we choose all clauses in I and the first Ksize clauses from Mho for use with the backend reasoner. We leave out clauses in <sup>M</sup>\(I∪Mho) because Ehoh can rederive them. We also expect large clauses with deep derivations to be less useful.

The remaining step is the translation of λ-abstractions. We support two translation methods: λ-lifting [24] and SKBCI combinators [48]. For SKBCI, we omit the combinator definition axioms, because they are very explosive [10]. A third mode simply omits clauses containing λ-abstractions.

**Evaluation and Discussion.** In Zipperposition, we can adjust the CPU time allotted to Ehoh, Ehoh's own proof search parameters, the point when Ehoh is invoked, the number Ksize of selected clauses from Mho, and the λ translation method. We fix the time limit to 5 s, use Ehoh in auto mode, and focus on the last three parameters. In base, collaboration with Ehoh is disabled.

Ehoh is invoked after <sup>K</sup>time ·<sup>t</sup> CPU seconds, where 0 <sup>≤</sup> <sup>K</sup>time <sup>&</sup>lt; 1 and <sup>t</sup> is the total CPU time allotted to Zipperposition. Figure 5 shows the effect of varying Ktime when Ksize = 32 and λ-lifting is used. The evaluation confirms that using a highly optimized backend such as Ehoh greatly improves the performance of a less optimized prover such as Zipperposition. The figure indicates that it is preferable to invoke the backend early. We have indeed observed that if the backend


Fig. 8: Comparison of competing higher-order theorem provers

is invoked late, small clauses with deep derivations tend to be present by then. These clauses might have been used to delete important shallow clauses already. But due to their derivation depth, they will not be translated. In such situations, it is better to invoke the backend before the important clauses are deleted.

Figure 6 quantifies the effects of the three λ-abstraction translation methods. We fixed Ktime = 0.25 and Ksize = 32. The clear winner is λ-lifting. Omitting clauses with λ-abstractions performs comparably to SKBCI combinators.

Figure 7 shows the effect of Ksize on performance, with Ktime = 0.25 and λ-lifting. We find that including a small number of higher-order clauses with the lowest weight performs better than including a large number of such clauses.

#### **8 Comparison with Other Provers**

Different choices of parameters lead to noticeably different sets of proved problems. In an attempt to use Zipperposition 2 to its full potential, we have created a portfolio mode that runs up to 50 configurations in parallel during the allotted time. To provide some context, we compare Zipperposition 2 with the latest versions of all higher-order provers that competed at CASC-J10: CVC4 1.8 [4], Leo-III 1.5 [42], Satallax 3.5 [11], and Vampire 4.5 [10]. Note that Vampire's higher-order schedule is optimized for running on a single core.

We use the same 2606 monomorphic higher-order TPTP 7.2.0 problems as elsewhere in this paper, but we try to replicate the CASC setup more faithfully. CASC-J10 was run on 8-core CPUs with a 120 s wall-clock limit and a 960 s CPU limit. Since we run the experiments on 4-core CPUs, we set the wall-clock limit to 240 s and keep the same CPU limit. Leo-III, Satallax, and Zipperposition are cooperative provers. We also run them in uncooperative mode, without their backends, to measure their intrinsic strength. Figure 8 summarizes the results.

Among the cooperative provers, Zipperposition is the one that depends the least on its backend, and its uncooperative mode is only one problem behind Satallax's cooperative mode. This confirms our hypothesis that λ-superposition is a suitable basis for automatic higher-order reasoning. This also suggests that the implementation of this calculus in a modern first-order superposition prover such as E or Vampire would achieve markedly better results. Moreover, we believe that there are still techniques inspired by tableaux, SAT solving, and SMT solving that could be adapted and integrated in saturation provers.

#### **9 Discussion and Conclusion**

Back in 1994, Kohlhase [27, Sect. 1.3] was optimistic about the future of higherorder automated reasoning:

The obstacles to proof search intrinsic to higher-order logic may well be compensated by the greater expressive power of higher-order logic and by the existence of shorter proofs. Thus higher-order automated theorem proving will be practically as feasible as first-order theorem proving is now as soon as the technological backlog is made up.

For higher-order superposition, the backlog consisted of designing calculus extensions, heuristics, and algorithms that mitigate its weaknesses. In this paper, we presented such enhancements, justified their design, and evaluated them. We explained how each weak point in the higher-order proving pipeline could be improved, from preprocessing to reasoning about formulas, to delaying unpromising or explosive inferences, to invoking a backend. Our evaluation indicates that higher-order superposition is now the state of the art in higher-order reasoning.

Higher-order extensions of first-order superposition have been considered by Bentkamp et al. [6, 7] and Bhayat and Reger [9, 10]. They introduced proof calculi, proved them refutationally complete, and suggested optional rules, but they hardly discussed the practical aspects of higher-order superposition. Extensions of SMT are discussed by Barbosa et al. [3]. Bachmair and Ganzinger [1], Manna and Waldinger [29], and Murray [31] have studied nonclausal resolution calculi.

In contrast, there is a vast literature on practical aspects of first-order reasoning using superposition and related calculi. The literature evaluates various procedures and techniques [21,36], literal and term order selection functions [20], and clause evaluation functions [19, 39], among others. Our work joins the select club of papers devoted to practical aspects of higher-order reasoning [8,16,41,53].

As a next step, we plan to implement the described techniques in Ehoh [50], the λ-free higher-order extension of E. We expect the resulting prover to be substantially more efficient than Zipperposition. Moreover, we want to investigate the proofs found by provers such as CVC4 and Satallax but missed by Zipperposition. Finding the reason behind why Zipperposition fails to prove specific problems will likely result in useful new techniques.

**Acknowledgment.** We are grateful to the maintainers of StarExec for letting us use their service. Ahmed Bhayat and Giles Reger guided us through details of Vampire 4.5. Ahmed Bhayat, Michael F¨arber, Mathias Fleury, Predrag Janiˇci´c, Mark Summerfield, and the anonymous reviewers suggested content, textual, and typesetting improvements. We thank them all.

Vukmirovi´c, Bentkamp, and Blanchette's research has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 713999, Matryoshka). Blanchette and Nummelin's research has received funding from the Netherlands Organization for Scientific Research (NWO) under the Vidi program (project No. 016.Vidi.189.037, Lean Forward) and the Incidental Financial Support scheme.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Dual Proof Generation for Quantified Boolean Formulas with a BDD-based Solver

Randal E. Bryant (-) and Marijn J. H. Heule

Computer Science Department Carnegie Mellon University, Pittsburgh, PA, United States {Randy.Bryant, mheule}@cs.cmu.edu

Abstract. Existing proof-generating quantified Boolean formula (QBF) solvers must construct a different type of proof depending on whether the formula is false (refutation) or true (satisfaction). We show that a QBF solver based on ordered binary decision diagrams (BDDs) can emit a single *dual proof* as it operates, supporting either outcome. This form consists of a sequence of equivalencepreserving clause addition and deletion steps in an extended resolution framework. For a false formula, the proof terminates with the empty clause, indicating conflict. For a true one, it terminates with all clauses deleted, indicating tautology. Both the length of the proof and the time required to check it are proportional to the total number of BDD operations performed. We evaluate our solver using a scalable benchmark based on a two-player tiling game.

## 1 Introduction

Adding quantifiers to Boolean formulas, yielding the logic of *quantified Boolean formulas* (QBFs), greatly extends their expressive power [11], but it presents several challenges, including verifying the output of a QBF solver. Unlike a satisfiable Boolean formula, there is no satisfying assignment for a QBF—the formula is simply false or true. Instead, a proof-generating QBF solver must provide a full proof in either case: a *refutation* proof if the formula is false, or a *satisfaction* proof if the formula is true.

Currently, there is little standardization of the proof capabilities or the proof systems supported by different QBF solvers [21]. Some solvers can generate *syntactic* certificates—ones that can be directly checked by a proof checker. For a false formula, these can be expressed in clausal proof frameworks that augment resolution with rules for universal quantification [18]. For a true formula, several QBF solvers can generate term resolution proofs [12], effectively reasoning about a negated version of the input formula represented in disjunctive form. These require the proof checker to support an entirely different set of proof rules.

An even larger number of solvers can generate *semantic* certificates in the form of Herbrand functions for false formulas and Skolem functions for true ones, describing how to instantiate either the universal or the existential variables [21]. These can be used to expand the original formula into a (often much larger) Boolean formula that is checked with a SAT solver [22] or with a high-degree polynomial algorithm [25]. Performing the check often requires far more effort than does running the solver. These approaches, along with others involving syntactic certificates, require at least two passes one to determine whether the formula is true or false and one to generate the proof.

This paper describes a new approach to proof generation for QBF, where the solver generates a *dual proof*, serving as either a refutation or a satisfaction proof depending on whether the solver determines the formula to be false or true. A dual proof consists of a sequence of clause addition and deletion steps, each preserving equivalence to the original formula. If the proof terminates with the addition of the empty clause, then it demonstrates that the original formula was contradictory and therefore false. If the proof terminates with all clauses removed, then it demonstrates that the original formula was equivalent to a tautology and is therefore true. The proofs are expressed in a clausal proof framework that incorporates extended resolution, as well as rules for universal and existential quantification [13, 14].

We have implemented a QBF solver PGBDDQ based on ordered binary decision diagrams (BDDs) that can generate dual proofs as it operates. As optimizations, PGBDDQ can be directed to generate refutation or satisfaction proofs, and these can be somewhat shorter and take less time to check than dual proofs. Refutation proofs follow the traditional format of a series of truth-preserving steps leading to an empty clause. Satisfaction proofs follow the novel format of a series of falsehood-preserving steps leading to an empty set of clauses. This approach for satisfaction proofs has been previously used as part of a QBF preprocessor [13, 14], but, to the best of our knowledge, ours is the first use in a complete QBF solver. Whether dual, refutation, or satisfaction, the proofs generated by PGBDDQ have length proportional to the number of BDD operations and can readily be validated by a simple proof checker.

For the case of refutation proofs, PGBDDQ builds on the work of Jussila, et al. [17], whose BDD-based QBF solver EBDDRES could generate refutation proofs in an extended resolution framework. Whereas their solver, as well as all other published BDDbased QBF solvers [23, 24], require the BDD variable ordering to be the inverse of the quantification ordering, PGBDDQ allows independent choices for the two orderings. As will be shown, this can lead to an exponential advantage on some benchmarks.

We evaluate the performance of PGBDDQ using a scalable benchmark based on a two-player tiling game. We show that, with the right combination of Tseitin variable placement, BDD variable ordering and elimination variable ordering, a BDD-based QBF solver can achieve performance that scales polynomially with the problem size. In these cases, PGBDDQ can readily outperform state-of-the-art search-based solvers, while having the added benefit that it generates a checkable proof.

## 2 Background Preliminaries

<sup>A</sup> *literal* l is either a variable y or its complement y. We denote the underlying variable for literal l as Var (l), while <sup>l</sup> denotes the complement of literal <sup>l</sup>.

A *clause* is a set of literals, representing the disjunction of a set of complemented and uncomplemented variables. The empty clause, indicating logical falsehood, is written ⊥. We consider only *proper* clauses, where a literal can only occur once in a clause, and a clause cannot contain both a variable and its complement. Logical truth, or tautology, is denoted and represented by an empty set of clauses. For clarity, we write clauses as Boolean formulas, such as <sup>x</sup> <sup>∧</sup> <sup>y</sup> <sup>→</sup> <sup>z</sup> for the clause {x, y, z}. As a special case, the unit clause consisting of literal l is simply written as l.

ITE: For Boolean values a, b, and c, the *ITE* operation (short for "If-Then-Else") is defined as: *ITE*(a, b, c)=(a∧b)∨(¬a∧c). This can be also be written as a conjunction of clauses: *ITE*(a, b, c)=(a <sup>→</sup> b) <sup>∧</sup> (¬a <sup>→</sup> c).

QBF: We consider quantified formulas in *prenex normal form* over a set of *input variables* <sup>X</sup>, with input formula <sup>Φ</sup><sup>I</sup> having the form <sup>Φ</sup><sup>I</sup> <sup>=</sup> <sup>Q</sup>1X<sup>1</sup> <sup>Q</sup>2X<sup>2</sup> ··· <sup>Q</sup>mX<sup>m</sup> <sup>ψ</sup><sup>I</sup> . The *quantifier prefix* <sup>Q</sup><sup>I</sup> <sup>=</sup> <sup>Q</sup>1X<sup>1</sup> <sup>Q</sup>2X<sup>2</sup> ··· <sup>Q</sup>mX<sup>m</sup> consists of a series of *quantifier blocks.* Each block <sup>j</sup> has an associated quantifier <sup>Q</sup><sup>j</sup> ∈ {∀, ∃} and a set of variables <sup>X</sup><sup>j</sup> <sup>⊆</sup> <sup>X</sup>, such that the sets <sup>X</sup>1, X2,...,X<sup>m</sup> form a partitioning of <sup>X</sup>. The formula *matrix* <sup>ψ</sup><sup>I</sup> is given as a set of clauses referred to as the *input* clauses. An input variable <sup>x</sup> occurring in some partition <sup>X</sup><sup>j</sup> is said to be *universal* (respectively, *existential*) when <sup>Q</sup><sup>j</sup> <sup>=</sup> <sup>∀</sup> (resp., <sup>Q</sup><sup>j</sup> <sup>=</sup> <sup>∃</sup>) and is said to be at *quantification level* <sup>j</sup>. The type and level of each literal l matches that of its underlying variable Var (l).

Resolution: Let C and D be clauses, where C contains variable y and D contains its complement <sup>y</sup>. We also require that there can be no literal <sup>l</sup> <sup>∈</sup> <sup>C</sup>, with <sup>l</sup> <sup>=</sup> <sup>y</sup>, such that <sup>l</sup> <sup>∈</sup> <sup>D</sup>. The *resolvent* clause is then defined as Res(C, D) = <sup>C</sup> <sup>∪</sup> <sup>D</sup> − {y, <sup>y</sup>}. When C and D do not satisfy the above requirements, then Res(C, D) is undefined. This definition does not allow the resolvent to be a tautology.

The resolution operation extends to linear chains and sets of clauses, as well. For a clause sequence <sup>C</sup>1, C2,...,C<sup>k</sup>, we define its resolvent as:

$$\operatorname{Res}(C\_1, C\_2, \dots, C\_k) = \operatorname{Res}(C\_1, \operatorname{Res}(C\_2, \dots, \operatorname{Res}(C\_{k-1}, C\_k) \cdot \dotsb))$$

The sequence <sup>C</sup>1, C2,...,C<sup>k</sup> is termed the *antecedent*. Again, the operation is undefined if any individual application of the operation is undefined. For a set of clauses ψ, we define Res(ψ) as the set of all resolvents that can be generated from sequences comprised of clauses from ψ with each clause used at most once per sequence.

As a separate notation, for a set of clauses <sup>ψ</sup>, we let Resy(ψ) be the set of all defined resolvents Res(C, D) with C, D <sup>∈</sup> ψ, y <sup>∈</sup> C, and <sup>y</sup> <sup>∈</sup> <sup>D</sup>.

Extension: Extended resolution [28] allows the introduction of *extension variables* to serve as a shorthand notation for other formulas. Generalizing extended resolution to quantified formulas requires additional considerations regarding 1) the distinction between existentially and universally quantified variables, and 2) the position of the extension variables within the quantification ordering. In particular, as extension variables are generated, they must be classified as existential and be inserted into intermediate positions in the ordering [3, 17]. To support this capability, we associate a *quantification level* λ(y) with each input and extension variable y. For input variable x, where <sup>x</sup> <sup>∈</sup> <sup>X</sup><sup>j</sup> , we define <sup>λ</sup>(x)=2<sup>j</sup> <sup>−</sup> <sup>1</sup>. Input variables will therefore have odd values for λ. Each extension variable e will be assigned an even value for λ(e) according to rules defined below. For literal l, we define λ(l) = λ(Var (l)).

As clauses are added and deleted, and as extension variables are introduced, a formula will be maintained with an overall form

$$\Phi = Q\_1 X\_1 \exists E\_1 \, Q\_2 X\_2 \exists E\_2 \, \cdots \, Q\_m X\_m \exists E\_m \, \psi \tag{1}$$

where <sup>E</sup><sup>1</sup>, E<sup>2</sup>,...,E<sup>m</sup> is a partitioning of the set of extension variables. The quantifier prefix Q in (1) is therefore an alternation of input and extension variables, with all extension variables being existentially quantified. We can also view the quantifier prefix as simply being a set of variables y, being ordered by the values of λ(y), and where y is universal when <sup>λ</sup>(y)=2<sup>j</sup> <sup>−</sup> <sup>1</sup> with <sup>Q</sup><sup>j</sup> <sup>=</sup> <sup>∀</sup>. Otherwise, <sup>y</sup> is existential. We use set notation when referring to the quantifier prefix, recognizing that the partitioning of variables into quantifier blocks and the associated quantifier types, are defined implicitly by the function λ.

Two quantifier prefixes Q and Q , each with m input variable blocks, are said to be *compatible* when <sup>Q</sup><sup>j</sup> <sup>=</sup> <sup>Q</sup> <sup>j</sup> for <sup>1</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>m</sup>, and <sup>λ</sup>(y) = <sup>λ</sup> (y) for all y ∈ Q∩Q , where the unprimed and primed symbols correspond to Q and Q , respectively.

Extension introduces existential variable e by adding a set of *defining clauses* θ to the matrix and adding e to the quantifier prefix. Consider QBF Φ <sup>=</sup> <sup>Q</sup> ψ. Let e be a fresh variable (i.e., e ∈ Q) and let θ be a set of clauses that are *blocked* on e [5]. That is, each clause in θ must contain either e or <sup>e</sup>, and for any clauses C, D <sup>∈</sup> <sup>θ</sup> for which e <sup>∈</sup> C and <sup>e</sup> <sup>∈</sup> <sup>D</sup>, there must be some other literal <sup>l</sup> <sup>∈</sup> <sup>C</sup> such that <sup>l</sup> <sup>∈</sup> <sup>D</sup>, and therefore Rese(θ) = <sup>∅</sup>. Define Φ <sup>=</sup> <sup>Q</sup> ψ as follows. Variable e is assigned quantification level λ(e) = max{Even(λ(y))|y <sup>∈</sup> Var (θ), y <sup>=</sup> e}, where Var (θ) is defined to be the set of all variables occurring in the clauses in θ. Function Even rounds a number up to the next higher even value, i.e., Even(a)=2 a/2. This definition guarantees that λ(e) is even and that every variable y occurring in θ will have λ(y) <sup>≤</sup> λ(e). Letting <sup>Q</sup> <sup>=</sup> Q∪{e} and ψ <sup>=</sup> ψ∪θ, it can be shown that Φ is true if and only if Φ is true [17]. Boolean Functions: The *restriction* of Boolean function f with respect to variable x, denoted <sup>f</sup>|<sup>x</sup> is defined as the function that results when variable <sup>x</sup> is assigned value 1. Similarly, <sup>f</sup>|<sup>x</sup> is defined as the function that results when <sup>x</sup> is assigned value 0.

The *Shannon expansion* relates a Boolean function to its restrictions with respect to a variable and its complement. For a function f and variable x:

$$\begin{aligned} f &= I\text{TE}\{x, f|\_x, f|\_{\overline{x}}\} \\ &= \left(x \to f|\_x\right) \land \left(\overline{x} \to f|\_{\overline{x}}\right) \end{aligned} \tag{2}$$

We will find clausal form (2) to be of use in generating satisfaction proofs.

For Boolean function f and variable x we can define the existential and universal quantifications of f with respect to x as projection operations that eliminate the dependency on x through either disjunction or conjunction:

$$\exists x \, f = f|\_{x} \lor f|\_{\overline{x}} \tag{3}$$

$$\forall x \, f = f|\_x \land f|\_{\overline{x}} \tag{4}$$

BDDs: A reduced, ordered binary decision diagram (BDD) provides a canonical form for representing a set of Boolean functions, and an associated set of algorithms for constructing them and testing their properties [1, 7, 8]. A set of functions is represented as a directed acyclic graph, with each function indicated by a pointer to its root node. We will therefore use the symbol u to refer at times to 1) a node in the BDD, 2) the subgraph of the BDD having u as its root, 3) the function represented by this subgraph, and 4) an extension variable associated with the node.

The ordered BDD representation requires defining a total ordering of the variables. Unlike other BDD-based QBF solvers [17, 23, 24], PGBDDQ allows this ordering to be independent of the ordering of variables in the quantifier prefix. The two leaf nodes are denoted <sup>L</sup><sup>0</sup> and <sup>L</sup>1, representing the constant functions <sup>0</sup> and <sup>1</sup>, respectively. Each nonterminal node u has an associated variable and two children indicating branches for the two possible values of the variable.

BDD packages support multiple operations for constructing and testing the properties of Boolean functions represented by a BDD. A number of these are based on the *Apply* algorithm [6]. Given root nodes u and v representing functions f and g, respectively, and a Boolean operation (e.g., AND), the algorithm generates a root node w representing the result of applying the operation to those functions (e.g., f <sup>∧</sup> g). It operates by traversing its arguments via a series of recursive calls, using a table to cache previously computed results. Variants of the Apply algorithm can also perform restriction and quantification.

QBF Solving with a BDD: With the ability to perform disjunction, conjunction, and quantification of Boolean functions, there is a straightforward algorithm for solving a QBF with a BDD. It starts by computing a representation of the formula matrix using the Apply algorithm with operation ∨ for each clause and conjuncting these using the Apply algorithm with operation ∧. Then, quantifiers are eliminated by working from the innermost quantifier block <sup>X</sup><sup>m</sup> and working outward, using either universal or existential quantifier operations. At the end, the BDD will be reduced to either <sup>L</sup><sup>0</sup> indicating that the formula is false, or <sup>L</sup><sup>1</sup> indicating that the formula is true. This basic algorithm can be improved by deferring some of the conjunctions and by carefully selecting the order of quantification within each quantifier block [23, 24].

## 3 Logical Foundations

A *clausal proof* consists of a sequence of steps starting with the clauses in the input formula <sup>Φ</sup><sup>I</sup> . Each step either adds a set of clauses, and possibly an extension variable, or it removes a set of clauses. These additions and removals define a sequence of QBFs <sup>Φ</sup>1, Φ2,...,Φ<sup>t</sup>, with <sup>Φ</sup><sup>1</sup> <sup>=</sup> <sup>Φ</sup><sup>I</sup> and each <sup>Φ</sup><sup>i</sup> of the form <sup>Q</sup><sup>i</sup> <sup>ψ</sup><sup>i</sup>.

For a refutation proof, each step <sup>i</sup> must preserve truth, i.e., <sup>Φ</sup><sup>i</sup> <sup>→</sup> <sup>Φ</sup><sup>i</sup>+1, and it must end with ⊥ ∈ <sup>ψ</sup><sup>t</sup>. This construction serves as a proof that <sup>Φ</sup><sup>I</sup> <sup>=</sup> <sup>Φ</sup><sup>1</sup> <sup>→</sup> <sup>Φ</sup><sup>2</sup> → ··· → <sup>Φ</sup><sup>t</sup> <sup>=</sup> <sup>⊥</sup>, and therefore the input formula is false. A satisfaction proof follows the same general format, except that it requires each step <sup>i</sup> to preserve falsehood: <sup>Φ</sup><sup>i</sup>+1 <sup>→</sup> <sup>Φ</sup><sup>i</sup>, and it reaches a final result with <sup>ψ</sup><sup>t</sup> <sup>=</sup> <sup>∅</sup>. This construction serves as a proof that <sup>=</sup> <sup>Φ</sup><sup>t</sup> <sup>→</sup> <sup>Φ</sup><sup>t</sup>−<sup>1</sup> →···→ <sup>Φ</sup><sup>1</sup> <sup>=</sup> <sup>Φ</sup><sup>I</sup> , and therefore the input formula is true. A *dual* proof requires that each step preserves equivalence: <sup>Φ</sup><sup>i</sup> <sup>↔</sup> <sup>Φ</sup><sup>i</sup>+1, i.e., it is both truth and falsehood preserving. Only the final step with <sup>ψ</sup><sup>t</sup> ∈ {⊥, } determines whether it is a refutation or a satisfaction proof.

#### 3.1 Inference Rules

Table 1 shows the equivalence-preserving inference rules we use in our proofs. These are based on *redundant clauses*—cases where there are two sets of clauses ψ and θ such that <sup>Q</sup> ψ ↔ Q (<sup>ψ</sup> <sup>∪</sup> <sup>θ</sup>), for compatible prefixes <sup>Q</sup> and <sup>Q</sup> . Thus, adding clauses θ to the matrix ψ defines an equivalence-preserving addition rule, while deleting them from the matrix ψ <sup>∪</sup> θ defines an equivalence-preserving removal rule.


Table 1. Inference rules where clause set θ is redundant with respect to the clauses in ψ.

We have already described resolution in Section 2. Universal reduction (also known as "forall reduction" [4,17]) is the standard rule for eliminating universal variables in a QBF refutation proof [18].

The extension rule forms the basis for adding extension variable y <sup>=</sup> e and its defining clauses θ. For this case, the clauses in θ are blocked with respect to y, and therefore Resy(θ) = <sup>∅</sup>. As a deletion rule, the existential elimination rule is used to remove extension variables and their defining clauses, as well as to remove the existential input variables. It is a generalization of *blocked clause elimination* [5] in that the clauses in θ need not be blocked, as long as ψ contains all of the resolvents with respect to variable y. The redundancies used by the resolution, extension, and existential elimination rules are special cases of the quantified resolution asymmetric tautology (QRAT) property [13, 14].

#### 3.2 Integrating Proof Generation into BDD Operations

As described in [16, 17, 26] and [9], we use a BDD to represent Boolean functions defined by applying Boolean operations to the input variables X. When creating node u, we introduce an extension variable, also referred to as u, with up to four defining clauses. For node <sup>u</sup> with variable <sup>x</sup>, and children nodes <sup>u</sup><sup>1</sup> and <sup>u</sup><sup>0</sup>, these clauses encode the formula <sup>u</sup> <sup>↔</sup> *ITE*(x, u<sup>1</sup>, u<sup>0</sup>). As described in Section 2, we will have λ(u) = max{λ(x)+1, λ(u<sup>1</sup>), λ(u<sup>0</sup>)}.

As in [9], we associate leaf nodes <sup>L</sup><sup>0</sup> and <sup>L</sup><sup>1</sup> directly with logical values <sup>⊥</sup> and . When constructing node <sup>u</sup>, if either <sup>u</sup><sup>1</sup> or <sup>u</sup><sup>0</sup> is a leaf node, the defining clauses may be simplified, and some may degenerate to tautologies. By defining λ(⊥) = λ()=0, we can still use the above formula to define the value of λ(u), such that λ(u<sup>1</sup>) <sup>≤</sup> <sup>λ</sup>(u), <sup>λ</sup>(u<sup>0</sup>) <sup>≤</sup> λ(u), and λ(x) < λ(u). This guarantees that the value of λ(u) is greater or equal to that of any node or variable occurring in the subgraph with root u.

For node u, define its *support set* S(u) as the set of variables occurring at some node in the subgraph with root u. Based on our construction, any node u will have λ(u)=2j if and only if there is some <sup>j</sup> and some <sup>x</sup> for which <sup>x</sup> <sup>∈</sup> <sup>X</sup><sup>j</sup> <sup>∩</sup> <sup>S</sup>(u), and this property does not hold for any j > j.

As a final notation, let θ(u) denote the set consisting of the defining clauses for all nodes in the subgraph with root u.

The BDD package implements the set of operations shown in the Table 2. Each generates a result node w, and it also generates sets of clauses forming extended reso-


Table 2. Required BDD Operations. Each generates a root node plus a set of proofs.

lution proofs of some properties relating the result to the arguments. As shown, some of these properties are truth preserving, while others are falsehood preserving. In each of these, C indicates a clause, u, v, and w are BDD nodes (or their associated extension variables), and l is a literal of an input variable.

These operations serve the following roles:


#### 4 Integrating Proof Generation into a QBF Solver

PGBDDQ solves a QBF by maintaining a set T of root nodes, which we refer to as "terms." Each term is the result of conjuncting and applying elimination operations to some subset of the input clauses. T initially contains the root nodes for the BDD representations of the input clauses. The solver repeatedly removes one or two terms from T, performs a quantification or conjunction operation, and adds the result to T, except that terms with value <sup>L</sup><sup>1</sup> are not added. Quantifiers are eliminated in reverse order, starting with block <sup>X</sup><sup>m</sup> and continuing through <sup>X</sup><sup>1</sup>. The process continues until either some generated term is the leaf value L<sup>0</sup>, indicating that the formula is false, or the set becomes empty, indicating that the formula is true. The solver simultaneously generates proof steps, including ones that add a unit clause u for each node u <sup>∈</sup> T.

Our presentation describes the general requirements for applying conjunction and elimination operations. These operations can be used to implement the basic method described in Section 2, as well as more sophisticated strategies that defer conjunctions until they are required before performing some of the elimination operations [23, 24].

Universal quantification commutes with conjunction and so can be applied to the terms independently. Applying existential quantification, on the other hand, requires performing conjunction operations until the variables to be quantified occur only in a single term.

#### 4.1 Dual Proof Generation

For both technical and implementation reasons, which we explain below, we require the input formula to have only a single variable in each quantifier block. This restriction can be satisfied by rewriting an arbitrary QBF, such that a quantifier block with k variables is *serialized*, splitting it into a sequence of k distinct quantification levels.

When generating a dual proof, the solver generates steps proving that each update to the set of terms T preserves equivalence with the input formula. More formally, consider a matrix ψ containing the following clauses: 1) unit clause u for each u <sup>∈</sup> T, plus 2) all of the defining clauses θ(u) for the subgraph rooted by each node u <sup>∈</sup> <sup>T</sup>. Let <sup>Q</sup> be the compatible quantifier prefix formed by augmenting input prefix <sup>Q</sup><sup>I</sup> with the extension variables associated with the nodes in these subgraphs. Then each update preserves the invariant that <sup>Q</sup><sup>I</sup> <sup>ψ</sup><sup>I</sup> ↔ Q <sup>ψ</sup>. Furthermore, the solver takes care to systematically delete clauses once they are no longer needed, using the removal rules listed in Table 1. That enables it to finish with an empty set of clauses in the event the formula is true. The initial set of terms T consists of a root node u for each input clause C, and the solver uses the proof that C, θ(u) u to justify adding unit clause u to the proof. It then uses this unit clause, plus the proof that u, θ(u) C to justify deleting input clause C.

Each step proceeds by generating new terms and by adding and removing clauses in the proof. Suppose the step involves computing results with root nodes <sup>w</sup>1, ..., w<sup>n</sup> based on argument terms u<sup>1</sup>,...,u<sup>k</sup>. If any of the result nodes is BDD leaf L<sup>0</sup>, then the formula is false. The solver can use truth-preserving rules generated by the BDD operations to justify adding an empty clause. Otherwise, the solver removes the argument terms from <sup>T</sup> and adds the result nodes, except for any equal to BDD leaf <sup>L</sup><sup>1</sup>. The solver uses the existing unit clauses plus the truth-preserving rules to justify adding unit clauses for each newly added term. It then uses the falsehood-preserving rules and the newly added unit clauses to justify deleting the unit clauses associated with the argument terms. It must also explicitly generate rules to remove some intermediate clauses that are added during these proof constructions. Other clauses, including the defining clauses for the BDD nodes and the clauses added during the BDD operations get removed by a separate process described in Section 4.2. The net effect for each step then is to replace the argument terms in T by the non-constant result terms, maintaining a unit clause for each term in T as part of the proof.

Conjunction operations. For u, v <sup>∈</sup> <sup>T</sup>, the solver computes <sup>w</sup> <sup>=</sup> APPLYAND(u, v). For the case where <sup>w</sup> <sup>=</sup> <sup>L</sup><sup>0</sup> the generated truth-preserving proof will be the clause <sup>u</sup>∨v, which resolves with unit clauses u and v to generate the empty clause—the solver has proved that the formula is false.

Otherwise, the solver sets T to be T − {u, v} ∪ w. The proof for adding unit clause w follows by resolving the unit clauses u and v with the generated clause <sup>u</sup> <sup>∨</sup> <sup>v</sup> <sup>∨</sup> <sup>w</sup>, (i.e., u <sup>∧</sup> v <sup>→</sup> w). The generated clauses w <sup>→</sup> u and w <sup>→</sup> v each resolve with unit clause w to justify deleting unit clauses u and v.

Universal elimination operation. This operation is performed when <sup>Q</sup><sup>j</sup> <sup>=</sup> <sup>∀</sup>, and by our restriction, we must have <sup>X</sup><sup>j</sup> <sup>=</sup> {x} for some universal variable <sup>x</sup>. We also require that the input variables for blocks Xjsuch that j > j have already been eliminated.

Since universal quantification commutes with conjunction, the solver can quantify each term individually and let subsequent conjunction operations perform the conjunction indicated in (4). That is, for each <sup>u</sup> <sup>∈</sup> <sup>T</sup> such that <sup>x</sup> <sup>∈</sup> <sup>S</sup>(u), operation RESTRICT is used to compute the two restrictions <sup>w</sup><sup>x</sup> <sup>=</sup> <sup>u</sup>|<sup>x</sup> and <sup>w</sup><sup>x</sup> <sup>=</sup> <sup>u</sup>|x. These will generate proofs of two downward implications: <sup>l</sup> <sup>∧</sup> <sup>u</sup> <sup>→</sup> <sup>w</sup><sup>l</sup> for <sup>l</sup> ∈ {x, <sup>x</sup>}, as well as proofs of two upward implications: <sup>l</sup> <sup>∧</sup> <sup>w</sup><sup>l</sup> <sup>→</sup> <sup>u</sup>.

If <sup>w</sup><sup>l</sup> equals leaf node <sup>L</sup><sup>0</sup> for either <sup>l</sup> <sup>=</sup> <sup>x</sup> or <sup>l</sup> <sup>=</sup> <sup>x</sup>, then the corresponding downward implication will be a clause of the form l∧u → ⊥ <sup>=</sup> <sup>l</sup>∨u. Resolving this with the unit clause u and applying universal reduction generates the empty clause—the solver has proved that the formula is false.

Consider the general case, where neither <sup>w</sup><sup>x</sup> nor <sup>w</sup><sup>x</sup> is a leaf node. The solver sets <sup>T</sup> <sup>=</sup> <sup>T</sup> ∪ {w<sup>x</sup>, w<sup>x</sup>}−{u}. The downward implications <sup>l</sup> <sup>∧</sup> <sup>u</sup> <sup>→</sup> <sup>w</sup><sup>l</sup> can be resolved with unit clause <sup>u</sup> to yield the clause <sup>l</sup> <sup>→</sup> <sup>w</sup><sup>l</sup> for <sup>l</sup> ∈ {x, <sup>x</sup>}. We can be certain that λ(w<sup>l</sup>) < λ(x) for both values of l, since x <sup>∈</sup> S(w<sup>l</sup>). Applying universal reduction to the two generated clauses then yields the unit clauses <sup>w</sup><sup>x</sup> and <sup>w</sup><sup>x</sup>. Resolving each unit clause <sup>w</sup><sup>l</sup> with the upward implication <sup>l</sup>∧w<sup>l</sup> <sup>→</sup> <sup>u</sup> gives the clause <sup>l</sup> <sup>→</sup> <sup>u</sup>, for <sup>l</sup> ∈ {x, <sup>x</sup>}. Resolving these with each other justifies deleting unit clause u. Intermediate clauses <sup>x</sup> <sup>→</sup> <sup>w</sup>, <sup>x</sup> <sup>→</sup> <sup>w</sup>, <sup>x</sup> <sup>→</sup> <sup>w</sup><sup>x</sup>, and <sup>x</sup> <sup>→</sup> <sup>w</sup><sup>x</sup> are removed by resolution deletion.

The case where one of the restrictions is the leaf node <sup>L</sup><sup>1</sup> is handled similarly to the general case, except that this node is not added to T.

Our implementation applies the conjunction operation to terms <sup>w</sup><sup>x</sup> and <sup>w</sup><sup>x</sup> immediately after they are generated to avoid causing the number of terms to expand by a factor of <sup>2</sup><sup>k</sup> when the formula contains a sequence of <sup>k</sup> universal quantifiers.

Existential elimination operations. This operation is performed when <sup>Q</sup><sup>j</sup> <sup>=</sup> <sup>∃</sup>. We can assume that <sup>X</sup><sup>j</sup> <sup>=</sup> {x} for some existential variable <sup>x</sup>. We require that the input variables for blocks X<sup>j</sup> such that j > j have already been eliminated. We also require the conjunction operations to have reduced T to contain at most one node u such that x <sup>∈</sup> S(u). The solver proceeds as follows to existentially quantify x from u yielding a new term w and creating the justification for adding unit clause w. It also removes unit clause <sup>u</sup>, as well as some intermediate clauses. Note that <sup>w</sup> can equal <sup>L</sup><sup>1</sup>, but not L<sup>0</sup>.


Overall Operation: For a false formula, the solver will terminate with the generation of leaf value <sup>L</sup><sup>0</sup> during a conjunction or universal quantification operation. These cases will cause the proof to terminate with the addition of an empty clause. For a true formula, the solver will finish with T equal to the empty set, since it never adds a leaf node to <sup>T</sup>. A final clause removal operation with quantification level 0 then yields <sup>ψ</sup><sup>t</sup> <sup>=</sup> <sup>∅</sup>.

We can see now why we impose the restriction that any quantifier block <sup>X</sup><sup>j</sup> with <sup>Q</sup><sup>j</sup> <sup>=</sup> <sup>∀</sup> contain only one variable. Without it, the universal variable elimination operation may not be possible. Suppose <sup>X</sup><sup>j</sup> <sup>=</sup> {x, x }. Attempting to perform the universal quantification operation on variable x could yield a BDD node w<sup>l</sup>, with either <sup>l</sup> <sup>=</sup> <sup>x</sup> or <sup>l</sup> <sup>=</sup> x, that depends on x . That would require that λ(w<sup>l</sup>) > λ(x ) = λ(x), and so the universal reduction rule could not be applied. Serializing the universal blocks avoids this difficulty, without limiting the generality of the solver.

#### 4.2 Clause Removal

As a dual proof proceeds, the BDD operations cause clauses to be added as extension variables are introduced and as inferences are made via resolution. Other clauses are added and removed explicitly by the proof steps, including the unit clauses for each term and the intermediate clauses generated by the steps. In order to support having the outcome of the solver be true, the defining and resolution clauses must be removed in order to ultimately end up with an empty set of clauses. The solver must justify their removal, since clause deletion is not, in general, equivalence preserving.

Clause removal is triggered when performing existential quantification, just before applying the variable elimination rule with variable <sup>x</sup> to remove clauses <sup>C</sup><sup>x</sup> and <sup>C</sup><sup>x</sup> (step 3). We must first ensure that there are no other clauses containing x or x.

Our method is to remove any clause C containing a literal l for which λ(l) > λ(x)=2j <sup>−</sup> <sup>1</sup>. Clause removal can proceed by stepping through the clauses in the reverse order from how they were added. If a clause that was added by resolution contains a literal l with λ(l) <sup>≥</sup> <sup>2</sup>j, it can be removed via resolution deletion, using the same antecedent as was used when it was added.

Suppose the solver encounters the defining clauses for a node u with λ(u) <sup>≥</sup> <sup>2</sup>j. It can be certain that all clauses added by resolution that contain either u or <sup>u</sup> have already been removed, since these must have followed the introduction of u in the clause ordering. Similarly, any parent node v of u must have already had its defining clauses removed, since the defining clauses for v must occur after those for u. The existential elimination rule can therefore be used to remove the defining clauses for u.

Working through the set of clauses in reverse order, the solver may encounter clauses added by resolution and defining clauses containing only literals l with λ(l) < <sup>2</sup>j <sup>−</sup> <sup>1</sup>. These need not be removed, and indeed they can prove useful (clauses added by resolution) or necessary (some defining clauses) for subsequent proof steps. They will be deleted by clause removal during later phases.

We can see now why we impose the restriction that any quantifier block <sup>X</sup><sup>j</sup> with <sup>Q</sup><sup>j</sup> <sup>=</sup> <sup>∃</sup> contain only one variable. It enables the use of the <sup>λ</sup> values to determine which clauses should be removed to eliminate any dependency on existential variable x. Serializing the existential quantifier blocks allows this scheme to work without limiting the generality of the solver.

#### 4.3 Specializing to Refutation or Satisfaction Proofs

Dual proofs have the advantage that they can be generated as a single pass, without knowing in advance whether the formula is true or false. On the other hand, they are, by necessity, somewhat longer and require more time to generate and to check. Another approach is to know (or guess) what the outcome will be and then direct the solver to generate a pure refutation or satisfaction proof. Specializing the proof generation to one of these forms is straightforward, and it can take advantage of more efficient ways to perform some of the quantifications.

A refutation proof need only justify that each step preserves truth. This enables several optimizations. Observe that deleting a clause always preserves truth, because it can only cause the set of satisfying solutions for the matrix to expand. Therefore clause deletion can be performed without any justification and instead be incorporated into the BDD garbage collection process [9]. Second, the BDD package need not generate the falsehood-preserving proofs shown in Table 2, reducing the number of clauses generated. Finally, the existential operation of (3) is inherently truth preserving. BDD packages can implement the quantification of a function by an entire set of variables via a variant of the Apply algorithm. If the quantification of root node u generates result node w, then the solver can run an implication test after the BDD computation has been performed to prove that u <sup>→</sup> w, as is done with our SAT solver [9]. This avoids the need to serialize existential quantifier blocks and to have the solver generate low-level proof steps for each existential variable.

Conversely, a satisfaction proof need only justify that each step preserves falsehood. Adding a clause always preserves falsehood, since it can only reduce the set of satisfying solutions for the matrix, and therefore clause addition can be performed without any justification. In addition, the BDD package need not generate the truth-preserving proofs shown in Table 2. Finally, universal quantification can be performed on an entire block of variables producing node w from argument u. The solver can then run an implication test to generate a proof that w <sup>→</sup> u.

#### 5 Experimental Results

PGBDDQ<sup>1</sup> is written entirely in Python and consists of around 3350 lines of code, including a BDD package, support for generating extended-resolution proofs, and the overall QBF solver. By comparison, our proof-generating BDD-based SAT solver required around 2130 lines of code [9]. PGBDDQ can generate proofs in either the QRAT format [13, 14] or in a format we call QPROOF that supports just the proof rules given in Table 1. The latter format requires explicit lists of antecedents, and therefore each step can be checked without any search.

The overall control of PGBDDQ is based on a form of bucket elimination [10], where each quantifier block <sup>X</sup><sup>j</sup> defines a bucket. It starts by generating BDD representations of the input clauses. The resulting terms are inserted into buckets according to the value of λ(u) for each root node u. As described in Section 3.2, this value will be <sup>2</sup>j when <sup>u</sup> contains a variable from block <sup>X</sup><sup>j</sup> in its support, and it has no variables at higher quantification levels.

Processing proceeds from the highest numbered bucket downward. For a universal level, quantification is performed for each bucket element individually with the results placed into buckets according to their values for λ. For an existential level, the elements are conjuncted and then existential quantification is performed. The result is placed into a bucket according to its value of λ.

We can see that this approach defers conjunction as long as possible, only operating on terms at some quantification level <sup>j</sup> that truly depend on one or more variables in <sup>X</sup><sup>j</sup> . Similar techniques have been used in other BDD-based QBF solvers [23,24]. However, other implementations place terms into buckets according to the BDD level of their root nodes, requiring the BDD variables to be ordered as the inverse of the quantification ordering. By labeling each node with its value of λ, we can determine the appropriate bucket from the root node without regard to the BDD variable ordering.

We have tested PGBDDQ on a number of scalable benchmark problems, finding it performs well in some cases, scaling polynomially, and poorly in others, scaling exponentially. Here we present results for a problem based on a two-player game. It provides insights into how polynomial scaling can be achieved, as well as the performance of the solver and two checkers.

Two-player games provide a rich set of benchmarks for QBF solvers, with each turn being translated into a quantification level. To encode the game from the perspective of the first player (Player A), A's turns are encoded with existential quantifiers, while the second player's (Player B) turns are encoded with universal quantifiers. The formula will be true if the game has a guaranteed winning strategy for A. The encoding of a game into QBF constrains the two players to only make legal moves. It also expresses the conditions under which A is the winner, namely that the game consist of t consecutive moves, for an odd value of t. Conversely, we can encode the formula where B has a winning strategy by reversing the quantifiers and expressing that the game must consist of an even number of consecutive moves. For a game where no draws are possible, these two formulas will be complementary.

<sup>1</sup> A demonstration version, complete with solver, checker, and benchmarks, is available at https://github.com/rebryant/pgbddq-artifact.

Consider a game played on a <sup>1</sup> <sup>×</sup> N grid of squares with a set of dominos, each of which can cover two squares. Players alternate turns, each placing a domino to cover two adjacent squares. The game completes when no more moves are possible, taking at most N/2 turns. The first player who cannot place a domino loses. This *linear domino placement* game is isomorphic to the object-removal game "Dawson's Kales" [2]. It can be shown that player B has a winning strategy for N ∈ {0, <sup>1</sup>, <sup>15</sup>, <sup>35</sup>} as well as for all values of the form <sup>34</sup> i <sup>+</sup> c where i <sup>≥</sup> <sup>0</sup> and c ∈ {5, <sup>9</sup>, <sup>21</sup>, <sup>25</sup>, <sup>29</sup>} [27].

The game is encoded as a QBF by introducing a set of N <sup>−</sup><sup>1</sup> input variables for each possible move, each corresponding to the boundary between a pair of adjacent squares. A set of N <sup>−</sup> <sup>1</sup> Tseitin variables encodes the board state after each move, and sets of clauses enforce the conditions that 1) each move should cover exactly one boundary, and 2) neither that boundary nor the two adjacent ones should have been covered previously. In all, there are around <sup>N</sup>2/<sup>4</sup> universal input variables, N2/<sup>4</sup> existential input variables, and <sup>3</sup>N2/<sup>2</sup> Tseitin variables. The number of clauses grows as <sup>Θ</sup>(N3) due to the quadratic number of clauses to enforce the exactly-one constraints on the input variables for each move.

To achieve polynomial performance, we found that several problem-specific techniques are required. First, the Tseitin variables for a given move are placed in an existential quantifier block immediately following the block for the input variables for the move. This is logically equivalent to the usual convention of placing all Tseitin variables in an innermost quantifier block, but it enables the bucket elimination algorithm to process the clauses for each move in sequence, rather than expanding the formulas in terms of only the input variables at the outset. Second, all variables are ordered for the BDD in "boundary-major" ordering. That is, all variables, including input and Tseitin variables, for the first boundary on the board are included from the first quantification level to the last. The variables for the second boundary follow similarly, and so on for all N <sup>−</sup> <sup>1</sup> boundaries. This ordering has the effect that, when processing the clauses for some move, the variables encoding the next, and previous state for a boundary, as well as the proposed change to its state, are localized within the ordering. Finally, when splitting a quantifier block into a series of single-variable blocks, we ordered them according to their BDD variable ordering. Since the solver eliminates variables in the reverse of their quantifier ordering, this convention causes the disjunction and conjunction operations of Equations (3) and (4) to be performed mainly on subgraphs of the BDD below the variables being quantified. This enables greater use of previously computed results via the operation cache.

Table 3 shows the performance of PGBDDQ, two checkers, and two other QBF solvers on the domino placement game as functions of N. It shows first cases where the encoded player has a winning strategy, and therefore the formula is true, and then cases where the encoded player's opponent has a winning strategy, and therefore the formula is false. Dual proofs were generated for both cases. For measurements with sufficient data points, we show the scaling trends, obtained by performing a linear regression on the logarithms of data generated for each value of N in increments of 5. All measurements were performed on a 4.2 GHz Intel Core i7 (I7-7700K) processor with 32 GB of memory running the MacOS operating system. Times are measured in elapsed seconds.


Table 3. Experimental Results for Dual Proof Generation with Linear Domino Placement Game. The first data series are for proofs of true formulas, and the second are for false formulas. Entries shown as "—" indicate cases where the program exceeded a 7200-second time limit.

As indicated in the column labeled "Input Clauses," the number of clauses grows as N2.7, not quite reaching the asymptotic value of <sup>N</sup>3. The number of proof clauses generated by PGBDDQ are nearly the same for both true and false formulas, with growth rates of <sup>N</sup>4.5. The time taken by the solver (labeled "Solve") , and by our own checker ("Qproof") scale at about the same rate as the number of proof clauses.

We also benchmarked the QBF proof checker QRAT-TRIM [13, 14]. This program was already equipped to handle our forms of refutation and satisfaction proofs, and it can handle dual proofs without modification. The only concession to the idiosyncrasies of PGBDDQ was to serialize the universal quantifier blocks in the prefix of false formulas. This is required to enable application of the universal reduction rule. The existential blocks can stay intact, since our only reason to serialize these is to guide the clause removal process. Although the scaling of QRAT-TRIM is poor, it is encouraging that the solver can be verified by a checker that predates it by a number of years.

For comparison, we evaluated the performance of two other QBF solvers on this benchmark: DEPQBF, version 6.0 [20], and GHOSTQ [15,19]. We found they are both very fast for smaller values of N but then reach a narrow range of values for which they transition from running in just a few seconds to exceeding the timeout limit of 7200 seconds. For DEPQBF, this transition occurs as N ranges from 17 to 21, and for GHOSTQ, as N ranges from 21 to 26. PGBDDQ is much slower for small values of N, but it keeps scaling without hitting a sudden cutoff.

Although we did not run EBDDRES [17], we can use PGBDDQ to evaluate the impact of having the BDD variable ordering be the inverse of the quantifier ordering. Our experiments show that this ordering causes the runtime and proof sizes to scale exponentially in <sup>N</sup>. With <sup>N</sup> = 14 and B as the player, PGBDDQ runs for 4100 seconds to generate a refutation proof with 114,157,025 clauses. By contrast, a boundary-major ordering requires just 6 seconds and generates a proof with 309,387 clauses.


Table 4. Experimental Results for Specialized Proof Generation with Linear Domino Placement Game. The first data series are for satisfaction proofs, and the second are for refutation proofs.

Table 4 shows the advantage of generating specialized proofs when the formula is known in advance to be true or false. Comparing the columns labeled "Total Clauses" in Tables 3 and 4, we can see especially that refutation proofs are asymptotically shorter. These can take advantage of the more efficient approach to existential quantification in handling the large number of Tseitin variables. Again, the solution and checking time track the proof sizes. These optimizations allowed us to solve larger instances of the problem—up to N = 45 for true instances and N = 50 for false ones.

#### 6 Conclusions

We have demonstrated that a QBF solver can emit a single proof as it operates, leading to either an empty clause for a false formula or an empty set of clauses for a true one. Both the proof and the time required to check it scale as the number of BDD operations performed. Moreover, a BDD-based QBF solver can allow the choice of BDD variable ordering to be made independently from the quantifier ordering. This feature can be critical to obtaining performance that scales polynomially with the problem size.

Our prototype is only a start in implementing a fully automated QBF solver. Such a solver must be able to choose a BDD variable ordering based on the input formula structure. It must also be able to identify and move Tseitin variables to earlier positions in the quantifier ordering, generating proof steps justifying that this transformation is equivalence preserving.

The underlying operation of PGBDDQ has potential applications beyond QBF solving. The program could stop the process described in Section 4.1 at any point and generate a QBF that is provably equivalent to the input formula. PGBDDQ could therefore be used as a preprocessor for other solvers, and for other applications that require reasoning about Boolean formulas with quantifiers.

Acknowledgements. The second author is supported by NSF grant CCF-2010951.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Reliable Reconstruction of Fine-grained Proofs in a Proof Assistant**

Hans-J¨org Schurr<sup>1</sup> , Mathias Fleury2,<sup>3</sup> , and Martin Desharnais<sup>4</sup>

<sup>1</sup> University of Lorraine, CNRS, Inria, and LORIA, Nancy, France hans-jorg.schurr@inria.fr <sup>2</sup> Johannes Kepler University Linz, Linz, Austria

mathias.fleury@jku.at

<sup>3</sup> Max-Planck Institute f¨ur Informatik, Saarland Informatics Campus, Saarbr¨ucken, Germany

> <sup>4</sup> Universit¨at der Bundeswehr M¨unchen, M¨unchen, Germany martin.desharnais@unibw.de

**Abstract.** We present a fast and reliable reconstruction of proofs generated by the SMT solver veriT in Isabelle. The fine-grained proof format makes the reconstruction simple and efficient. For typical proof steps, such as arithmetic reasoning and skolemization, our reconstruction can avoid expensive search. By skipping proof steps that are irrelevant for Isabelle, the performance of proof checking is improved. Our method increases the success rate of Sledgehammer by halving the failure rate and reduces the checking time by 13%. We provide a detailed evaluation of the reconstruction time for each rule. The runtime is influenced by both simple rules that appear very often and common complex rules.

**Keywords:** automatic theorem provers · proof assistants · proof verification

## **1 Introduction**

Proof assistants are used in verification and formal mathematics to provide trustworthy, machine-checkable formal proofs of theorems. Proof automation reduces the burden of finding proofs and allows proof assistant users to focus on the core of their arguments instead of technical details. A successful approach implemented by "hammers," like Sledgehammer for Isabelle [15], is to heuristically selects facts from the background; use an external automatic theorem prover, such as a satisfiability modulo theories (SMT) solver [12], to filter facts needed to discharge the goal; and to use the filtered facts to find a trusted proof.

Isabelle does not accept proofs that do not go through the assistant's inference kernel. Hence, Sledgehammer attempts to find the fastest internal method that can recreate the proof (preplay). This is often a call of the smt tactic, which runs an SMT solver, parses the proof, and reconstructs it through the kernel. This reconstruction allows the usage of external provers. The smt tactic was originally developed for the SMT solver Z3 [18, 34].

The SMT solver CVC4 [10] is one of the best solvers on Sledgehammer generated problems [14], but currently does not produce proofs for problems with quantifiers. To reconstruct its proofs, Sledgehammer mostly uses the smt tactic based on Z3. However, since CVC4 uses more elaborate quantifier instantiation techniques, many problems provable for CVC4 are unprovable for Z3. Therefore, Sledgehammer regularly fails to find a trusted proof and the user has to write the proofs manually. veriT [19] (Sect. 2) supports these techniques and we extend the smt tactic to reconstruct its proofs. With the new reconstruction (Sect. 3), more smt calls are successful. Hence, less manual labor is required from users.

The runtime of the smt method depends on the runtime of the reconstruction and the solver. To simplify the reconstruction, we do not treat veriT as a black box anymore, but extend it to produce more detailed proofs that are easier to reconstruct. We use detailed rules for simplifications with a combination of propositional, arithmetic, and quantifier reasoning. Similarly, we add additional information to avoid search, e.g., for linear arithmetic and for term normalization. Our reconstruction method uses the newly provided information, but it also has a step skipping mode that combines some steps (Sect. 4).

A very early prototype of the extension was used to validate the fine-grained proof format itself [7, Sect. 6.2, second paragraph]. We also published some details of the reconstruction method and the rules [25] before adapting veriT to ease reconstruction. Here, we focus on the new features.

We optimize the performance further by tuning the search performed by veriT. Multiple options influence the execution time of an SMT solver. To fine-tune veriT's search procedure, we select four different combinations of options, or strategies, by generating typical problems and selecting options with complementary performance on these problems. We extend Sledgehammer to compare these four selected strategies and suggest the fastest to the user. We then evaluate the reconstruction with Sledgehammer on a large benchmark set. Our new tactic halves the failure rate. We also study the time required to reconstruct each rule. Many simple rules occur often, showing the importance of step skipping (Sect. 5).

Finally, we discuss related work (Sect. 6). Compared to the prototype [25], the smt tactic is now thoroughly tested. We fixed all issues revealed during development and improved the performance of the reconstruction method. The work presented here is integrated into Isabelle version 2021; i.e., since this version Sledgehammer can also suggest veriT, without user interaction. To simplify future reconstruction efforts, we document the proof format and all rules used by veriT. The resulting reference manual is part of the veriT documentation [40].

#### **2 veriT and Proofs**

The SMT solver veriT is an open source solver based on the CDCL(T ) calculus. In proof-production mode, it supports the theories of uninterpreted functions with equality, linear real and integer arithmetic, and quantifiers. To support quantifiers veriT uses quantifier instantiation and extensive preprocessing.

veriT's proof syntax is an extension of SMT-LIB [11] which uses S-expressions and prefix notation. The proofs are refutation proofs, i.e., proofs of K. A proof is an indexed list of steps. Each step has a conclusion clause (cl ..) and is annotated with a rule, a list of premises, and some rule-dependent arguments. veriT distinguishes 90 rules [40]. Subproofs are the key feature of the proof format. They introduce an additional context. Contexts are used to reason about binders, e.g., preprocessing steps like transformation under quantifiers.

The conclusions of rules with contexts are always equalities. The context models a substitution into the free variables of the term on the left-hand side of the equality. Consider the following proof fragment that renames the variable name x to vr, as done during preprocessing:

```
(assume a0 (exists (x A) (f x))
(anchor :step t3 :args (:= x vr))
(step t1 (cl (= x vr)) :rule refl)
(step t2 (cl (= (f x) (f vr))) :rule cong :premises (t1))
(step t3 (cl (= (exists (x A) (f x))
                (exists (vr A) (f vr))) :rule bind)
```
The assume command repeats input assertions or states local assumptions. In this fragment the assumption a0 is not used. Subproofs start with the anchor command that introduces a context. Semantically, the context is a shorthand for a lambda abstraction of the free variable and an application of the substituted term. Here the context is <sup>x</sup> ÞÑ vr and the step t1 means <sup>p</sup>λx. x<sup>q</sup> vr " vr. The step is proven by congruence (rule cong). Then congruence is applied again (step t2) to prove that <sup>p</sup>λx. f x<sup>q</sup> vr " <sup>f</sup> vr and step t3 concludes the renaming.

During proof search each module of veriT appends steps onto a list. Once the proof is completed, veriT performs some cleanup before printing the proof. First, a pruning phase removes branches of the proof not connected to the root K. Second, a merge phase removes duplicated steps. The final pass prepares the data structures for the optional term sharing via name annotations.

# **3 Overview of the veriT-Powered smt Tactic**

Isabelle is a generic proof assistant based on an intuitionistic logic framework, Pure, and is almost always only used parameterized with a logic. In this work we use only Isabelle/HOL, the parameterization of Isabelle with higher-order logic with rank-1 (top level) polymorphism. Isabelle adheres to the LCF [26] tradition. Its kernel supports only a small number of inferences. Tactics are programs that prove a goal by using only the kernel for inferences. The LCF tradition also means that external tools, like SMT solvers, are not trusted.

Nevertheless, external tools are successfully used. They provide relevant facts or a detailed proof. The Sledgehammer tool implements the former and passes the filtered facts to trusted tactics during preplay. The smt tactic implements the latter approach. The provided proof is checked by Isabelle. We focus on the smt tactic, but we also extended Sledgehammer to also suggest our new tactic.

The smt tactic translates the current goal to the SMT-LIB format [11], runs an SMT solver, parses the proof, and replays it through Isabelle's kernel. To choose the smt tactic the user applies (smt (z3)) to use Z3 and (smt (verit)) to use veriT. We will refer to them as z-smt and v-smt. The proof formats of Z3 and veriT are so different that separate reconstruction modules are needed. The v-smt tactic performs four steps:


#### **4 Tuning the Reconstruction**

To improve the speed of the reconstruction method, we create small and welldefined rules for preprocessing simplifications (Sect. 4.1). Previously, veriT implicitly normalized every step; e.g., repeated literals were immediately deleted. It now produces proofs for this transformation (Sect. 4.2). Finally, the linear-arithmetic steps contain coefficients which allow Isabelle to reconstruct the step without relying on its limited arithmetic automation (Sect. 4.3). On the Isabelle side, the reconstruction module selectively decodes the first-order encoding (Sect. 4.4). To improve the performance of the reconstruction, it skips some steps (Sect. 4.5).

#### **4.1 Preprocessing Rules**

During preprocessing SMT solvers perform simplifications on the operator level which are often akin to simple calculations; e.g., <sup>a</sup> <sup>ˆ</sup> <sup>0</sup> <sup>ˆ</sup> <sup>f</sup>px<sup>q</sup> is replaced by 0.

To capture such simplifications, we create a list of 17 new rules: one rule per arithmetic operator, one to replace boolean operators such as XOR with their definition, and one to replace n-ary operator applications with binary applications. This is a compromise: having one rule for every possible simplification would create a longer proof. Since preprocessing uses structural recursion, the implementation simply picks the right rule in each leaf case. The example above now produces a prod simplify step with the conclusion <sup>a</sup> <sup>ˆ</sup> <sup>0</sup> <sup>ˆ</sup> <sup>f</sup>pxq " 0. Previously, a single step of the connect equiv rule collected all those simplifications and no list of simplifications performed by this rule existed. The reconstruction relied an experimentally created list of tactics to be fast enough.

On the Isabelle side, the reconstruction is fast, because we can direct the search instead of trying automated tactics that can also work on other parts of the formula. For example, the simplifier handles the numeral manipulations of the prod simplify rule and we restrict it to only use arithmetic lemmas.

Moreover, since we know the performed transformations, we can ignore some parts of the terms by generalizing, i.e., replacing them by constants [18]. Because generalized terms are smaller, the search is more directed and we are less likely to hit the search-depth limitation of Isabelle's auto tactic as before. Overall, the reconstruction is more robust and easier to debug.

#### **4.2 Implicit Steps**

To simplify reconstruction, we avoid any implicit normal form of conclusions. For example, a rule concluding <sup>t</sup>\_<sup>P</sup> for any formula <sup>t</sup> can be used to prove <sup>P</sup> \_P. In such cases veriT automatically normalizes the conclusion <sup>P</sup> \_ <sup>P</sup> to <sup>P</sup>. Without a proof of the normalization, the reconstruction has to handle such cases.

We add new proof rules for the normalization and extend veriT to use them. Instead of keeping only the normalized step, both the original and the normalized step appear in the proof. For the example above, we have the step <sup>P</sup> \_ <sup>P</sup> and the normalized <sup>P</sup>. To remove a double negation -t we introduce the tautology --<sup>t</sup> \_ <sup>t</sup> and resolve it with the original clause. Our changes do not affect any other part of veriT. The solver now also prunes steps concluding J.

On the Isabelle side, the reconstruction becomes more regular with fewer special cases and is more reliable. The reconstruction method can directly reconstruct rules. To deal with the normalization, the reconstruction used to first generate the conclusion of the theorem and then ran the simplifier to match the normalized conclusion. This could not deal with tautologies.

We also improve the proof reconstruction of quantifier instantiation steps. One of the instantiation schemes, conflicting instances [8,36], only works on clausified terms. We introduce an explicit quantified-clausification rule qnt cnf issued before instantiating. While this rule is not detailed, knowing when clausification is needed improves reconstruction, because it avoids clausifying unconditionally. The clausification is also shared between instantiations of the same term.

#### **4.3 Arithmetic Reasoning**

We use a proof witness to handle linear arithmetic. When the propositional model is unsatisfiable in the theory of linear real arithmetic, the solver creates la generic steps. The conclusion is a tautological clause of linear inequalities and equations and the justification of the step is a list of coefficients so that the linear combination is a trivially contradictory inequality after simplification (e.g., 0 ě 1). Farkas' lemma guarantees the existence of such coefficients for reals. Most SMT solvers, including veriT, use the simplex method [21] to handle linear arithmetic. It calculates the coefficients during normal operation.

The real arithmetic solver also strengthens inequalities on integer variables before adding them to the simplex method. For example, if x is an integer the inequality 2<sup>x</sup> <sup>ă</sup> 3 becomes <sup>x</sup> <sup>ď</sup> 1. The corresponding justification is the rational coefficient <sup>1</sup>{<sup>2</sup>. The reconstruction must replay this strengthening.

The complete linear arithmetic proof step 1 <sup>ă</sup> <sup>x</sup> \_ <sup>2</sup><sup>x</sup> <sup>ă</sup> 3 looks like

$$\begin{array}{l} \{\mathsf{step} \ \mathsf{t}\ 1\ \{\mathsf{c1}\ \{\leq 1\ x\}\ \{\leq\ \{\ast\ 2\ x\}\ \mathsf{3}\}\}\\ \mathsf{rrule} \ \mathsf{1a\_generic} \ \mathsf{args} \ \{\!\!\!1\ \{\mathsf{div}\ 1\ 2\}\}\} \end{array}$$

Ž The reconstruction of an la generic step in Isabelle starts with the goal i c<sup>i</sup> where each c<sup>i</sup> is either an equality or an inequality. The reconstruction method first generalizes over the non-arithmetic parts. Then it transforms the lemma into the equivalent formulation <sup>c</sup><sup>1</sup> ñ ¨¨¨ ñ <sup>c</sup><sup>n</sup> ñ K and removes all negations (e.g., by replacing <sup>a</sup> <sup>ď</sup> <sup>b</sup> with <sup>b</sup> <sup>ą</sup> <sup>a</sup>).

Next, the reconstruction method multiplies the equation by the corresponding coefficient. For example, for integers, the equation <sup>A</sup> <sup>ă</sup> <sup>B</sup>, and the coefficient <sup>p</sup>{<sup>q</sup> (with <sup>p</sup> <sup>ą</sup> 0 and <sup>q</sup> <sup>ą</sup> 0), it strengthens the equation and multiplies by <sup>p</sup> to get

<sup>p</sup> ˆ pA div qq ` <sup>p</sup> ˆ pif B mod q " 0 then 1 else 0q ď <sup>p</sup> ˆ pB div qq.

The if-then-else term <sup>p</sup>if B mod q " <sup>0</sup> then <sup>1</sup> else <sup>0</sup><sup>q</sup> corresponds to the strengthening. If B mod q " 0, the result is an equation of the form <sup>A</sup><sup>1</sup> ` <sup>1</sup> <sup>ď</sup> <sup>B</sup><sup>1</sup> , i.e., <sup>A</sup><sup>1</sup> <sup>ă</sup> <sup>B</sup><sup>1</sup> . No strengthening is required for the corresponding theorem over reals.

Finally, we can combine all the equations by summing them while being careful with the equalities that can appear. We simplify the resulting (in)equality using Isabelle's simplifier to derive K.

To replay linear arithmetic steps, Isabelle can also use the tactic linarith as used for Z3 proofs. It searches the coefficients necessary to verify the lemma. The reconstruction used it previously [25], but the tactic can only find integer coefficients and fails if strengthening is required. Now the rule is a mechanically checkable certificate.

#### **4.4 Selective Decoding of the First-order Encoding**

Next, we consider an example of a rule that shows the interplay of the higher-order encoding and the reconstruction. To express function application, the encoding introduces the first-order function app and constants for encoded functions. The proof rule eq congruent expresses congruence on a first-order function: <sup>p</sup>t<sup>1</sup> ‰ <sup>u</sup>1q \_ ... \_ pt<sup>n</sup> ‰ <sup>u</sup>nq \_ <sup>f</sup>pt1,...,tnq " <sup>f</sup>pu1,...,unq. With the encoding it can conclude <sup>f</sup> ‰ <sup>f</sup><sup>1</sup> \_ <sup>x</sup> ‰ <sup>x</sup><sup>1</sup> \_ apppf, xq " apppf<sup>1</sup> , x<sup>1</sup> q. If the reconstruction unfolds the entire encoding, it builds the term <sup>f</sup> ‰f<sup>1</sup>\_x‰x<sup>1</sup>\_fx"f<sup>1</sup> x1 . It then identifies the functions and the function arguments and uses rewriting to prove that if <sup>f</sup> " <sup>f</sup><sup>1</sup> and <sup>x</sup> " <sup>x</sup><sup>1</sup> , then fx " <sup>f</sup><sup>1</sup> x1 .

However, Isabelle β-reduces all terms implicitly, changing the term structure. Assume <sup>f</sup> :" λx. x " <sup>a</sup> and <sup>f</sup><sup>1</sup> :" λx. a " <sup>x</sup>. After unfolding all constructs that encode higher-order terms and after <sup>β</sup>-reduction, we get <sup>p</sup>λx. x " <sup>a</sup>q‰pλx. a " x1 q\_p<sup>x</sup> ‰ <sup>x</sup><sup>1</sup> q\_p<sup>x</sup> " <sup>a</sup>q"p<sup>a</sup> " <sup>y</sup><sup>1</sup> q. The reconstruction method cannot identify the functions and function arguments anymore.

Instead, the reconstruction method does not unfold the encoding including app. This eliminates the need for a special case to detect lambda functions. Such a case was used in the previous prototype, but the code was very involved and hard to test (such steps are rarely used).

#### **4.5 Skipping Steps**

The increased number of steps in the fine-grained proof format slows down reconstruction. For example, consider skolemization from <sup>D</sup>x. P x. The proof from Z3 uses one step. veriT uses eight steps—first renaming it to pDx. P xq " pDv. P v<sup>q</sup> (with a subproof of at least 2 steps), then concluding the renaming to get pDv. P v<sup>q</sup> (two steps), then pDv. P vq " <sup>P</sup> <sup>p</sup>v. P v<sup>q</sup> (with a subproof of at least 2 steps), and finally <sup>P</sup> <sup>p</sup>v. P v<sup>q</sup> (two steps).

To reduce the number of steps, our reconstruction skips two kinds of steps. First, it replaces every usage of the or rule by its only premise. Second, it skips the renaming of bound variables. The proof format treats @x. P x and @y. P y as two different terms and requires a detailed proof of the conversion. Isabelle, however, uses De Bruijn indices and variable names are irrelevant. Hence, we replace steps of the form p@x. P xq ô p@y. P y<sup>q</sup> by a single application of reflexivity. Since veriT canonizes all variable names, this eliminates many steps.

We can also simplify the idiom "equiv pos2; th resolution". veriT generates it for each skolemization and variable renaming. Step skipping replaces it by a single step which we replay using a specialized theorem.

On proof with quantifiers, step skipping can remove more than half of the steps—only four steps remain in the skolemization example above (where two are simply reflexivity). However, with step skipping the smt method is not an independent checker that confirms the validity of every single step in a proof.

#### **5 Evaluation**

During development we routinely tested our proof reconstruction to find bugs. As a side effect, we produced SMT-LIB files corresponding to the calls. We measure the performance of veriT with various options on them and select five different strategies (Sect. 5.1). We also evaluate the repartition of the tactics used by Sledgehammer for preplay (Sect. 5.2), and the impact of the rules (Sect. 5.3).

We performed the strategy selection on a computer with two Intel Xeon Gold 6130 CPUs (32 cores, 64 threads) and 192 GiB of RAM. We performed Isabelle experiments with Isabelle version 2021 on a computer with two AMD EPYC 7702 CPUs (128 cores, 256 threads) and 2 TiB of RAM.

#### **5.1 Strategies**

veriT exposes a wide range of options to fine-tune the proof search. In order to find good combinations of options (strategies), we generate problems with Sledgehammer and use them to fine-tune veriT's search behavior. Generating problems also makes it possible to test and debug our reconstruction.

We test the reconstruction by using Isabelle's Mirabelle tool. It reads theories and automatically runs Sledgehammer [14] on all proof steps. Sledgehammer calls various automatic provers (here the SMT solvers CVC4, veriT, and Z3 and the superposition prover E [38]) to filter facts and chooses the fastest tactic that can prove the goal. The tactic smt is used as a last resort.


**Table 1.** Options corresponding to the different veriT strategies

To generate problems for tuning veriT, we use the theories from HOL-Library (an extended standard library containing various developments) and from the formalizations of Green's theorem [2, 3], the Prime Number Theorem [23], and the KBO ordering [13]. We call Mirabelle with only veriT as a fact filter. This produces SMT files for representative problems Isabelle users want to solve and a series of calls to v-smt. For failing v-smt calls three cases are possible: veriT does not find a proof, reconstruction times out, or reconstruction fails with an error. We solved all reconstruction failures in the test theories.

To find good strategies, we determine which problems are solved by several combination of options within a two second timeout. We then choose the strategy which solves the most benchmarks and three strategies which together solve the most benchmarks. For comparison, we also keep the default strategy.

The strategies are shown in Table 1 and mostly differ in the instantiation schemes. The strategy del insts uses instance deletion [6] and uses a breadthfirst algorithm to find conflicting instances. All other strategies rely on extended trigger inference [29]. The strategy ccfv SIG uses a different indexing method for instantiation. It also restricts enumerative instantiation [35], because the options --index-sorts and --index-fresh-sorts are not used. The strategy ccfv insts increases some thresholds. Finally, the strategy best uses a subset of the options used by the other strategies. Sledgehammer uses best for fact filtering.

We have also considered using a scheduler in Isabelle as used in the SMT competition. The advantage is that we do not need to select the strategy on the Isabelle side. However, it would make v-smt unreliable. A problem solved by only one strategy just before the end of its time slice can become unprovable on slower hardware. Issues with z-smt timeouts have been reported on the Isabelle mailing list, e.g., due to an antivirus delaying the startup [27].

#### **5.2 Improvements of Sledgehammer Results**

To measure the performance of the v-smt tactic, we ran Mirabelle on the full HOL-Library, the theory Prime Distribution Elementary (PDE) [22], an executable resolution prover (RP) [37], and the Simplex algorithm [30]. We extended Sledgehammer's proof preplay to try all veriT strategies and added instrumentation for

**Table 2.** Outcome of Sledgehammer calls showing the total success rate (SR, higher is better) of one-liner proof preplay, the number of suggested v-smt (OLv) and z-smt (OLz) one-liners, and the number of preplay failures (PF, lower is better), in percentages of the unique goals.


the time of all tried tactics. Sledgehammer and automatic provers are mostly nondeterministic programs. To reduce the variance between the different Mirabelle runs, we use the deterministic MePo fact filter [33] instead of the better performing MaSh [28] that uses machine learning (and depends on previous runs) and underuse the hardware to minimize contention. We use the default timeouts of 30 seconds for the fact filtering and one second for the proof preplay. This is similar to the Judgment Day experiments [17]. The raw results are available [1].

Success Rate. Users are not interested in which tactics are used to prove a goal, but in how often Sledgehammer succeeds. There are three possible outcomes: (i) a successfully preplayed proof, (ii) a proof hint that failed to be preplayed (usually because of a timeout), or (iii) no proof. We define the success rate as the proportion of outcome (i) over the total number of Sledgehammer calls.

Table 2 gathers the results of running Sledgehammer on all unique goals and analyzing its outcome using different preplay configurations where only z-smt (the baseline) or both v-smt and z-smt are enabled. Any useful preplay tactic should increase the success rate (SR) by preplaying new proof hints provided by the fact-filter prover, reducing the preplay failure rate (PF).

Let us consider, e.g., the results when using CVC4 as fact-filter prover. The success rate of the baseline on the HOL-Library is 54.5% and its preplay failure rate is 1.5%. This means that CVC4 found a proof for 54.5%`1.5% " 56% of the goals, but that Isabelle's proof methods failed to preplay many of them. In such

cases, Sledgehammer gives a proof hint to the user, which has to manually find a functioning proof. By enabling v-smt, the failure rate decreases by two thirds, from 1.5% to 0.5%, which directly increases the success rate by 1 percentage point: new cases where the burden of the proof is moved from the user to the proof assistant. The failure rate is reduced in similar proportions for PNT (63%), RP (63%), and Simplex (56%). For these formalizations, this improvement translates to a smaller increase of the success rate, because the baseline failure rate was smaller to begin with. This confirms that the instantiation technique conflicting instances [8, 36] is important for CVC4.

When using veriT or Z3 as fact-filter prover, a failure rate of zero could be expected, since the same SMT solvers are used for both fact filtering and preplaying. The observed failure rate can partly be explained by the much smaller timeout for preplay (1 second) than for fact filtering (30 seconds).

Overall, these results show that our proof reconstruction enables Sledgehammer to successfully preplay more proofs. With v-smt enabled, the weighted average failure rate decreases as follows: for CVC4, from 1.3% to 0.4%; for E, from 1.5% to 1.2%; for veriT, from 1.0% to 0.3%; and for Z3, from 0.7% to 0.3%. For the user, this means that the availability of v-smt as a proof preplay tactic increases the number of goals that can be fully automatically proved.

Saved time. Table 3 shows a different view on the same results. Instead of the raw success rate, it shows the time that is spent reconstructing proofs. Using the baseline configuration, preplaying all formalizations takes a total of 250.<sup>1</sup> ` <sup>33</sup>.<sup>4</sup> ` <sup>37</sup>.<sup>2</sup> ` <sup>42</sup>.<sup>8</sup> " <sup>363</sup>.5 seconds. When enabling v-smt, some calls to z-smt are replaced by faster v-smt calls and the reconstruction time decreases by 13% to 212.<sup>6</sup> ` <sup>28</sup>.<sup>4</sup> ` <sup>34</sup>.<sup>4</sup> ` <sup>41</sup>.<sup>6</sup> " 317 seconds. Note that the per-formalization improvement varies considerably: 15% for HOL-Library, 15% for PNT, 7.5% for RP, and 4.0% for Simplex.

For the user, this means that enabling v-smt as a proof preplay tactic may significantly reduce the verification time of their formalizations.

Impact of the Strategies. We have also studied what happens if we remove a single veriT strategy from Sledgehammer (Table 4). The most important one is best, as it solves the highest number of problems. On the contrary, default is nearly entirely covered by the other strategies. ccfv SIG and del insts have a similar number where they are faster than Z3, but the latter has more unique goals and therefore, saves more time. Each strategy has some uniquely solved problems that cannot be reconstructed using any other. The results are similar for the other theories used in Table 3.

#### **5.3 Speed of Reconstruction**

To better understand what the key rules of our reconstruction are, we recorded the time used to reconstruct each rule and the time required by the solver over all calls attempted by Sledgehammer including the ones not selected. The reconstruction ratio (reconstruction over search time) shows how much slower reconstructing

**Table 3.** Preplayed proofs (Pr.) and their execution time (s) when using CVC4 as fact-filter prover. Shared proofs are found with and without v-smt and new proofs are found only with v-smt. The proofs and their associated timings are categorized in one-liners using v-smt (OLv), z-smt (OLz), or any other Isabelle proof methods (OLo).


**Table 4.** Reconstruction time and number of solved goals when removing a single strategy (HOL-Library results only), using CVC4 as fact filter.


compared to finding a proof is. For the 25% of the proofs, Z3's concise format is better and the reconstruction is faster than proof finding (first quartile: 0.9 for v-smt vs. 0.1 for z-smt). The 99th percentile of the proofs (18.6 vs. 27.2) shows that veriT's detailed proof format reduces the number of slow proofs. The reconstruction is slower than finding proofs on average for both solvers.

Fig. 1 shows the distribution of the time spent on some rules. We remove the slowest and fastest 5% of the applications, because garbage collection can trigger at any moment and even trivial rules can be slow. Fig. 2 gives the sum of all reconstruction times over all proofs. We call parsing the time required to parse and convert the veriT proof into Isabelle terms.

Overall, there are two kinds of rules: (1) direct application of a sequence of theorems—e.g., equiv pos2 corresponds to the theorem <sup>p</sup><sup>a</sup> <sup>ô</sup> <sup>b</sup>q\_<sup>a</sup> \_ <sup>b</sup> and (2) calls to full-blown tactics—like qnt cnf (Sect. 4.2).

First, direct application of theorems are usually fast, but they occur so often that the cumulative time is substantial. For example, cong only needs to unfold assumptions and apply reflexivity and symmetry of equality. However, it appears so often and sometimes on large terms, that it is an important rule.

Second, rules which require full-blown tactics are the slowest rules. For qnt cnf (CNF under quantifiers, see Sect. 4.2), we have not written a specialized tactic, but rely on Isabelle's tableau-based blast tactic. This rule is rather slow, but is rarely used. It is similar to the rule la generic: it is slow on average, but searching the coefficients takes even more time.

We can also see that the time required to check the simplification steps that were formerly combined into the connect equiv rule is not significant anymore.

We have performed the same experiments with the reconstruction of the SMT solver Z3. In contrast to veriT, we do not have the amount of time required for parsing. The results are shown in Figs. 3 and 4. The rule distribution is very different. The nnf-neg and nnf-pos rules are the slowest rules and take a huge amount of time in the worst case. However, the coarser quantifier instantiation step is on average faster than the one produced by veriT. We suspect that reconstruction is faster because the rule, which is only an implication without choice terms, is easier to check (no equality reordering).

#### **6 Related Work**

The SMT solvers CVC4 [10], Z3 [34], and veriT [19] produce proofs. CVC4 does not record quantifier reasoning in the proof, and Z3 uses some macro rules. Proofs from SMT solvers have also been used to find unsatisfiability cores [20], and interpolants [32]. They are also useful to debug the solver itself, since unsound steps often point to the origin of bugs. Our work also relates to systems like Dedukti [5] that focuses on translating proof steps, not on replaying them.

Proof reconstruction has been implemented in various systems, including CVC4 proofs in HOL Light [31], Z3 in HOL4 and Isabelle/HOL [18], and veriT [4] and CVC4 [24] in Coq. Only veriT produces detailed proofs for preprocessing and skolemization. SMTCoq [4, 24] currently supports veriT's version 1 of the proof output which has different rules, does not support detailed skolemization rules, and is implemented in the 2016 version of veriT, which has worse performance. SMTCoq also supports bit vectors and arrays.

The reconstruction of Z3 proofs in HOL4 and Isabelle/HOL is one of the most advanced and well tested. It is regularly used by Isabelle users. The Z3 proof reconstruction succeeds in more than 90% of Sledgehammer benchmarks [14, Section 9] and is efficient (an older version of Z3 was used). Performance numbers are reported [16,18] not only for problems generated by proof assistants (including Isabelle), but also for preexisting SMT-LIB files from the SMT-LIB library.

The performance study by B¨ohme [16, Sect. 3.4] uses version 2.15 of Z3, whereas we use version 4.4.0 which currently ships with Isabelle. Since version 2.15, the proof format changed slightly (e.g., th-lemma-arith was introduced), fulfilling some of the wishes expressed by B¨ohme and Weber [18] to simplify reconstruction. Surprisingly, the nnf rules do not appear among the five rules that used the most runtime. Instead, the th-lemma and rewrite rules were the

**Fig. 1.** Timing, sorted by the median, of a subset of veriT's rules. From left to right, the lower whisker marks the 5th percentile, the lower box line the first quartile, the middle of the box the median, the upper box line the third quartile, and the upper whisker the 95th percentile.

**Fig. 2.** Total percentage spent on each rule for the SMT solver veriT in the same order as Fig. 1. This graph maps the rules already shown in Fig. 1 to the total amount of time. The slowest rules are th resolution (14.7%), parsing (10.3%), and cong (9.77%).

**Fig. 3.** Timing of some of Z3's rules sorted by median. From left to right, the lower whisker marks the 5th percentile, the lower box line the first quartile, the middle of the box the median, the upper box line the third quartile, and the upper whisker the 95th percentile. nnf-neg's 95th percentile is 87 ms, nnf-pos's is 33 ms, and parsing's is 25 ms.

**Fig. 4.** Total amount of time per rule for the SMT solver Z3. nnf-neg takes 39% of the reconstruction time.

slowest. Similarly to veriT, the cong rule was among the most used (without accounting for the most time), but it does not appear in our Z3 tests.

CVC4 follows a different philosophy compared to veriT and Z3: it produces proofs in a logical framework with side conditions [39]. The output can contain programs to check certain rules. The proof format is flexible in some aspects and restrictive in others. Currently CVC4 does not generate proofs for quantifiers.

#### **7 Conclusion**

We presented an efficient reconstruction of proofs generated by a modern SMT solver in an interactive theorem prover. Our improvements address reconstruction challenges for proof steps of typical inferences performed by SMT solvers.

By studying the time required to replay each rule, we were able to compare the reconstruction for two different proof formats with different design directions. The very detailed proof format of veriT makes the reconstruction easier to implement and allows for more specialization of the tactics. On slow proofs, the ratio of time to reconstruct and time to find a proof is better for our more detailed format. Integrating our reconstruction in Isabelle halves the number of failures from Sledgehammer and nicely completes the existing reconstruction method with Z3.

Our work is integrated into Isabelle version 2021. Sledgehammer suggests the veriT-based reconstruction if it is the fastest tactic that finds the proof; so users profit without action required on their side. We plan to improve the reconstruction of the slowest rules and remove inconsistencies in the proof format. The developers of the SMT solver CVC4 are currently rewriting the proof generation and plan to support a similar proof format. We hope to be able to reuse the current reconstruction code by only adding support for CVC4-specific rules. Generating and reconstructing proofs from the veriT version with higher-order logic [9] could also improve the usefulness of veriT on Isabelle problems. The current proof rules [40] should accommodate the more expressive logic.

Acknowledgment We would like to thank Haniel Barbosa for his support with the implementation in veriT. We also thank Haniel Barbosa, Jasmin Blanchette, Pascal Fontaine, Daniela Kaufmann, Petar Vukmirovi´c, and the anonymous reviewers for many fruitful discussions and suggesting many textual improvements. The first and third authors have received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreements No. 713999, Matryoshka, and No. 830927, Concordia). The second author is supported by the LIT AI Lab funded by the State of Upper Austria. The training presented in this paper was carried out using the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### An Automated Approach to the Collatz Conjecture*-*

Emre Yolcu<sup>1</sup> , Scott Aaronson2, and Marijn J. H. Heule1,<sup>3</sup>

<sup>1</sup> Carnegie Mellon University, Pittsburgh, PA 15213, USA {emreyolcu,marijn}@cmu.edu <sup>2</sup> University of Texas at Austin, Austin, TX 78712, USA scott@scottaaronson.com <sup>3</sup> Amazon Scholar

Abstract. We explore the Collatz conjecture and its variants through the lens of termination of string rewriting. We construct a rewriting system that simulates the iterated application of the Collatz function on strings corresponding to mixed binary–ternary representations of positive integers. Termination of this rewriting system is equivalent to the Collatz conjecture. To show the feasibility of our approach in proving mathematically interesting statements, we implement a minimal termination prover that uses the automated method of matrix/arctic interpretations and we perform experiments where we obtain proofs of nontrivial weakenings of the Collatz conjecture. Finally, we adapt our rewriting system to show that other open problems in mathematics can also be approached as termination problems for relatively small rewriting systems. Although we do not succeed in proving the Collatz conjecture, we believe that the ideas here represent an interesting new approach.

# 1 Introduction

Let <sup>N</sup> <sup>=</sup> {0, <sup>1</sup>, <sup>2</sup>,...} denote the natural numbers and <sup>N</sup><sup>+</sup> <sup>=</sup> {1, <sup>2</sup>, <sup>3</sup>,...} denote the positive integers. We define the *Collatz function* <sup>C</sup> : <sup>N</sup><sup>+</sup> <sup>→</sup> <sup>N</sup><sup>+</sup> as

$$C(n) = \begin{cases} n/2 & \text{if } n \equiv 0 \pmod{2} \\ 3n+1 & \text{if } n \equiv 1 \pmod{2} . \end{cases}$$

Given a function <sup>f</sup> and a number <sup>k</sup> <sup>∈</sup> <sup>N</sup>, the function <sup>f</sup> <sup>k</sup> denotes the <sup>k</sup>th iterate of <sup>f</sup>. The well-known *Collatz conjecture* is the following:

*Conjecture 1.* For all <sup>n</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup>, there exists some <sup>k</sup> <sup>∈</sup> <sup>N</sup> such that <sup>C</sup><sup>k</sup>(n)=1.

This is a longstanding open problem and there is a vast literature dedicated to its study. For its history, we refer the reader to the comprehensive surveys by Lagarias [17–19].

Definition 1 (Convergent function). *Consider a function* <sup>f</sup> : <sup>X</sup> <sup>→</sup> <sup>X</sup>*. Given* <sup>x</sup> <sup>∈</sup> <sup>X</sup>*, the sequence of iterates* f<sup>τ</sup> (x) := (x, f(x), f <sup>2</sup>(x),...) *is called the* f-trajectory of x*. For some designated element* <sup>z</sup> <sup>∈</sup> <sup>X</sup>*, if for all* <sup>x</sup> <sup>∈</sup> <sup>X</sup> *the trajectory* <sup>f</sup><sup>τ</sup> (x) *contains* <sup>z</sup>*, the function* f *is called* convergent*.*

<sup>-</sup> The full version is available at https://www.cs.cmu.edu/~eyolcu/research/rewriting-collatz.pdf. © The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 468-484, 2021. https://doi.org/10.1007/978-3-030-79876-5\_27

In this paper, we describe an approach based on termination of string rewriting to automatically search for a proof of the Collatz conjecture. Although trying to prove the Collatz conjecture via automated deduction is clearly a moonshot goal, there are two recent technological advances that provide reasons for optimism that at least some interesting variants of the problem might be solvable. First, the invention of the method of matrix interpretations and its variants such as arctic interpretations turns the quest of finding a ranking function to witness termination into a problem that is suitable for systematic search. Second, the progress in satisfiability (SAT) solving makes it possible to solve many seemingly difficult combinatorial problems efficiently in practice. Their combination, i.e., using SAT solvers to find interpretations, has so far been effective in solving challenging termination problems. We make the following contributions:


# 2 Preliminaries

#### 2.1 String Rewriting Systems

Definition 2 (String rewriting system). *Let* Σ *be an alphabet, i.e., a set of symbols. A* string rewriting system *(SRS) over* <sup>Σ</sup> *is a relation* <sup>R</sup> <sup>⊆</sup> <sup>Σ</sup><sup>∗</sup> <sup>×</sup> <sup>Σ</sup>∗*. Elements* (, r) <sup>∈</sup> <sup>R</sup> *are called* rewrite rules *and are usually written as* <sup>→</sup> <sup>r</sup>*. The system* <sup>R</sup> *induces a* rewrite relation <sup>→</sup><sup>R</sup> := {(st, srt) <sup>|</sup> s, t <sup>∈</sup> <sup>Σ</sup>∗, <sup>→</sup> <sup>r</sup> <sup>∈</sup> <sup>R</sup>} *on the set* <sup>Σ</sup><sup>∗</sup> *of strings.*

Definition 3 (Termination). *A relation* <sup>→</sup> *on* <sup>A</sup> *is* terminating *(denoted* SN(→)*) if there is no infinite sequence* <sup>s</sup>0, s1,... <sup>∈</sup> <sup>A</sup> *such that* <sup>s</sup><sup>i</sup> <sup>→</sup> <sup>s</sup><sup>i</sup>+1 *for all* <sup>i</sup> <sup>≥</sup> <sup>0</sup>*.*

We conflate an SRS R with the rewrite relation it induces, writing "R is terminating" instead of "→<sup>R</sup> is terminating". The following is a useful generalization of termination:

Definition 4 (Relative termination). *For SRSs* R *and* S*, the system* R *is said to be* terminating relative to S *(denoted* SN(R/S)*) if every sequence of rewrites for the system* <sup>R</sup> <sup>∪</sup> <sup>S</sup> *applies the rules from* <sup>R</sup> *at most finitely many times.*

Relative termination allows proofs to be broken into steps as codified by the following.

Lemma 1 (Rule removal [29, Theorem 1]). *Let* R *be an SRS. If there exists a subset* <sup>T</sup> <sup>⊆</sup> <sup>R</sup> *such that* SN(T /R) *and* SN(<sup>R</sup> \ <sup>T</sup>)*, then* SN(R)*.*

This lemma allows us to "remove rules" in the following way. When proving SN(R), if we succeed at finding a subset T satisfying SN(T /R), the proof obligation becomes weakened to SN(R\T), where the rules of <sup>T</sup> are no longer present. This removal of rules can be repeated until no rules remain, thus producing a stepwise proof of termination.

Another useful technique is reversal:

Lemma 2 (Rule reversal [29, Lemma 2]). *For a string* <sup>s</sup> <sup>=</sup> <sup>s</sup><sup>1</sup> ...s<sup>n</sup> <sup>∈</sup> <sup>Σ</sup>∗*, denote* <sup>s</sup>rev := <sup>s</sup><sup>n</sup> ...s<sup>1</sup> *and define the* reversal *of an SRS* <sup>R</sup> *as* <sup>R</sup>rev := {rev <sup>→</sup> <sup>r</sup>rev <sup>|</sup> <sup>→</sup> <sup>r</sup> <sup>∈</sup> <sup>R</sup>}*. For SRSs* <sup>R</sup> *and* <sup>S</sup>*, we have* SN(R/S) *if and only if* SN(Rrev/Srev)*.*

Reversal is of interest because methods for proving termination are not necessarily invariant under reversal, that is, a given technique may fail to show termination of a system R while succeeding for its reversal Rrev.

Yet another important notion is top termination:

Definition 5 (Top termination). *Let* R *be an SRS over* Σ*. The* top rewrite relation *induced by* <sup>R</sup> *is defined as* <sup>→</sup><sup>R</sup>top := {(s, rs) <sup>|</sup> <sup>s</sup> <sup>∈</sup> <sup>Σ</sup>∗, <sup>→</sup> <sup>r</sup> <sup>∈</sup> <sup>R</sup>}*. If* <sup>→</sup><sup>R</sup>top *is terminating,* R *is said to be* top terminating*.*

In plain language, top termination allows rewrites to be performed only at the leftmost end of a string. As we will see in the next section (Theorem 1), top termination problems can admit proofs of a more relaxed form compared to termination. Relative top termination, i.e., proving SN(Rtop/S) for SRSs R and S, is a crucial component in the dependency pair approach [1] which reduces a termination problem to a relative top termination problem that is often easier to solve. In order to avoid requiring familiarity with the dependency pair approach, we omit its discussion, and instead prove a self-contained result (Lemma 4) that encapsulates dependency pairs in a more elementary manner for the specific rewriting systems that we consider in this paper.

#### 2.2 Interpretation Method

We state (at a high level) the key results on matrix/arctic interpretations that we use in our implementation. For more details we refer the reader to existing work [2,6,10,15,26]. With the interpretation method, the main idea is to find a ranking function that assigns a value to each string such that it decreases strictly when the string is modified by an application of a rewrite rule. If for all strings the value is bounded from below, then it cannot decrease indefinitely, ruling out the existence of an infinite sequence of rewrites. Formally, we search for an instance of the following:

Definition 6 (Extended/weakly monotone algebra). *Let* Σ *be an alphabet,* A *a set,* [σ]: <sup>A</sup> <sup>→</sup> <sup>A</sup> *an interpretation for every* <sup>σ</sup> <sup>∈</sup> <sup>Σ</sup>*,* <sup>&</sup>gt; *and order relations over* A *such that* > *is well-founded and satisfies* <sup>&</sup>gt; · - <sup>⊆</sup> <sup>&</sup>gt;*. Letting* [·]<sup>Σ</sup> := {[σ] <sup>|</sup> <sup>σ</sup> <sup>∈</sup> <sup>Σ</sup>}*, the structure* (A, [·]Σ, >, -) *is a* weakly monotone <sup>Σ</sup>-algebra *if for every* <sup>σ</sup> <sup>∈</sup> <sup>Σ</sup> *the interpretation* [σ] *is monotone with respect to* -*. It is an* extended monotone Σ-algebra *if, additionally, for every* <sup>σ</sup> <sup>∈</sup> <sup>Σ</sup> *the interpretation* [σ] *is monotone with respect to* <sup>&</sup>gt;*.*

We extend the interpretation from symbols to strings <sup>s</sup> <sup>=</sup> <sup>s</sup><sup>1</sup> ...s<sup>n</sup> <sup>∈</sup> <sup>Σ</sup><sup>∗</sup> as [s] := [s1] ◦···◦ [sn]. The following general theorem characterizes relative termination (resp. top termination) as the existence of extended (resp. weakly) monotone algebras.

Theorem 1 ([6, Theorem 2]). *Let* R *and* S *be SRSs over the alphabet* Σ*. We have* SN(R/S) *(resp.* SN(Rtop/S)*) if and only if there exists an extended (resp. weakly) monotone* <sup>Σ</sup>*-algebra* (A, [·]Σ, >, -) *such that*

– *for each rule* <sup>→</sup> <sup>r</sup> <sup>∈</sup> <sup>R</sup> *we have* [](x) <sup>&</sup>gt; [r](x) *for all* <sup>x</sup> <sup>∈</sup> <sup>A</sup>*,*

– *for each rule* <sup>→</sup> <sup>r</sup> <sup>∈</sup> <sup>S</sup> *we have* [](x) -[r](x) *for all* <sup>x</sup> <sup>∈</sup> <sup>A</sup>*.*

An effective way to prove relative (top) termination is to try to satisfy the conditions of the above theorem by fixing (A, >, -) and algorithmically searching for appropriate interpretations of symbols. Matrix interpretations is an instance of this method. We fix a dimension d, set A = N<sup>d</sup>, define x y ⇐⇒ <sup>x</sup><sup>i</sup> <sup>≥</sup> <sup>y</sup><sup>i</sup> for all <sup>i</sup> ∈ {1,...,d}, and define x>y ⇐⇒ x y <sup>∧</sup> <sup>x</sup><sup>1</sup> > y1. For interpreting each symbol <sup>σ</sup> <sup>∈</sup> <sup>Σ</sup>, we consider an affine function [σ](x) = <sup>M</sup>σx+vσ. In this way, the structure (N<sup>d</sup>, [·]Σ, >, -) satisfies the requirements of Definition 6 for a weakly monotone algebra. Additionally setting (Mσ)1,<sup>1</sup> = 1 satisfies the requirements for an extended monotone algebra. Matrix interpretations can also be adapted to the max–plus algebra of arctic numbers A := <sup>N</sup>∪{−∞} as coefficients with different arithmetic operations and order relations [15,26].

*Example 1.* Let <sup>R</sup> <sup>=</sup> {aa <sup>→</sup> aba} and <sup>S</sup> <sup>=</sup> {<sup>b</sup> <sup>→</sup> bb}. The following functions constitute a matrix interpretations proof that shows SN(R/S).

$$[a](\vec{x}) = \begin{bmatrix} 1 & 1 \\ 0 & 0 \end{bmatrix} \vec{x} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} \qquad [b](\vec{x}) = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} \vec{x} + \begin{bmatrix} 0 \\ 0 \end{bmatrix}$$

It can be checked that the above interpretations give an extended monotone algebra and that they satisfy the following for all x <sup>∈</sup> <sup>N</sup>2, which implies SN(R/S) via Theorem 1.

$$\begin{aligned} [aa](\vec{x}) &= \begin{bmatrix} 1 & 1 \\ 0 & 0 \end{bmatrix} \vec{x} + \begin{bmatrix} 1 \\ 1 \end{bmatrix} > \begin{bmatrix} 1 & 1 \\ 0 & 0 \end{bmatrix} \vec{x} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} = [aba](\vec{x}) \\\ [b](\vec{x}) &= \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} \vec{x} + \begin{bmatrix} 0 \\ 0 \end{bmatrix} \gtrsim \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} \vec{x} + \begin{bmatrix} 0 \\ 0 \end{bmatrix} = [bb](\vec{x}) \end{aligned}$$

In order to automate the search for the interpretations given a rewriting system R, an effective approach is to encode all of the aforementioned constraints as a propositional formula in CNF and use a SAT solver to look for a satisfying assignment. This additionally involves fixing a finite domain for the coefficients that can occur in the interpretations and encoding arithmetic over the chosen finite domain using propositional variables.

#### 2.3 Generalized Collatz Functions

We consider instances of the following generalization of the Collatz function. Its variants have commonly appeared in the literature [3, 12, 14, 16, 21, 24, 27].

Definition 7 (Generalized Collatz function). *Let* X *be one of* N*,* N<sup>+</sup>*, or* Z *and define* <sup>X</sup><sup>⊥</sup> := <sup>X</sup> ∪ {⊥}*. A function* <sup>f</sup> : <sup>X</sup><sup>⊥</sup> <sup>→</sup> <sup>X</sup><sup>⊥</sup> *is a* generalized Collatz function *if* <sup>f</sup>(⊥) = <sup>⊥</sup> *and there exist an integer* <sup>d</sup> <sup>≥</sup> <sup>2</sup> *and rational numbers* <sup>q</sup>0,...,qd−1, r0,...,rd−<sup>1</sup> *such that for all* <sup>0</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>d</sup> <sup>−</sup> <sup>1</sup> *and all* <sup>n</sup> <sup>∈</sup> <sup>X</sup>*, we have*

$$\begin{array}{ccc} f(n) = q\_i n + r\_i & \text{if } n \equiv i \pmod{d} \\ \text{or} & f(n) = \bot & \text{if } n \equiv i \pmod{d} .\end{array}$$

In the above, we allow the representation of a partially defined function by mapping to <sup>⊥</sup> in the undefined cases. We call a partial <sup>f</sup> convergent if all <sup>f</sup>-trajectories contain <sup>⊥</sup>.

Note that the Collatz function corresponds to a generalized one with d = 2, q<sup>0</sup> = 1/2, r<sup>0</sup> = 0, q<sup>1</sup> = 3, r<sup>1</sup> = 1. Although the Collatz function is by far the most widely studied case, there are several other concrete examples of generalized Collatz functions the convergence of which is worth studying due to their connections to open problems in number theory and computability theory. We discuss these cases in Section 5.

# 3 Rewriting the Collatz Function

We start with systems that use unary representations and then demonstrate via examples that mixed base representations can be more suitable for use with automated methods.

#### 3.1 Rewriting in Unary

The following system of Zantema [29] simulates iterated application of the Collatz function to a number represented in unary, and terminates upon reaching 1.

*Example 2.* Z denotes the following SRS, consisting of 5 symbols and 7 rules.


This system can be seen as encoding the execution of a Turing machine with cells that can be contracted/expanded. The symbols 1 and (blank) form the tape alphabet, while the symbols h (half), s (shift), t (triple) indicate the head along with the state of the machine. Through the following result, the Collatz conjecture can be reformulated as termination of string rewriting.

Theorem 2 ([29]). Z *is terminating if and only if the Collatz conjecture holds.*

While the forward direction of the above theorem is easy to see (since h1<sup>2</sup><sup>n</sup> →<sup>∗</sup> Z h1<sup>n</sup> for n > <sup>1</sup> and h1<sup>2</sup>n+1 →<sup>∗</sup> <sup>Z</sup> h1<sup>3</sup>n+2 for <sup>n</sup> <sup>≥</sup> <sup>0</sup>), the backward direction is far from obvious because not every string corresponds to a valid configuration of the underlying machine.

As another example, consider the system <sup>W</sup> <sup>=</sup> {h11 <sup>→</sup> 1h, 1h → 1t, 1t <sup>→</sup> t111, <sup>t</sup> → h} (originally due to Zantema4). Termination of this system has yet to be proved via automated methods. Nevertheless, there is a simple reason for its termination:

<sup>4</sup> https://www.lri.fr/~marche/tpdb/tpdb-2.0/SRS/Zantema/z079.srs

It simulates iterated application of a partial generalized Collatz function W : N<sup>+</sup> <sup>⊥</sup> <sup>→</sup> <sup>N</sup><sup>+</sup> ⊥ defined as follows, which is easily seen to be convergent.

$$W(n) = \begin{cases} 3n/2 & \text{if } n \equiv 0 \pmod{2} \\ \bot & \text{if } n \equiv 1 \pmod{2} \end{cases}$$

If a proof of the Collatz conjecture is to be produced by some automated method that relies on rewriting, then that method better be able to prove a statement as simple as the convergence of W. With this in mind, we describe an alternative rewriting system that simulates the Collatz function and terminates upon reaching 1. We then provide examples where the alternative system is more suitable for use with termination tools (for instance allowing an automated proof of the convergence of W).

#### 3.2 Rewriting in Mixed Base

In the mixed base scheme, the overall idea is as follows. Given a number <sup>n</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup>, we write a mixed binary–ternary representation for it (noting that this representation is not unique). With this representation, as long as the least significant digit is binary, the parity of the number can be recognized by checking only this digit, as opposed to scanning the entire string when working in unary. This allows us to easily determine the correct case when applying the Collatz function. If the least significant digit is ternary, then the representation is rewritten (while preserving its decimal value) to make this digit binary. Afterwards, since computing n/2 corresponds to erasing a trailing binary 0 and computing 3n + 1 corresponds to inserting a trailing ternary 1, applying the Collatz function takes a single rewrite step. We explain this scheme more formally below.

A mixed base numeral system is a numeral system where the base changes across positions, which we define as follows. Note that unary is not a positional numeral system, so we require the bases to be greater than 1.

Definition 8 (Mixed base representation). *Let* <sup>B</sup> <sup>⊆</sup> <sup>N</sup><sup>&</sup>gt;<sup>1</sup> *be a set of bases and let* <sup>N</sup> <sup>=</sup> <sup>n</sup>1b<sup>1</sup> <sup>n</sup>2b<sup>2</sup> ...nkb<sup>k</sup> *be a string where* <sup>n</sup><sup>i</sup> <sup>∈</sup> <sup>N</sup>*. If we have for each* <sup>1</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> <sup>k</sup> *that* <sup>b</sup><sup>i</sup> <sup>∈</sup> <sup>B</sup> *and* <sup>0</sup> <sup>≤</sup> <sup>n</sup><sup>i</sup> < bi*, then* <sup>N</sup> *is called a* mixed <sup>B</sup>-ary representation*.*

The string <sup>N</sup> from above represents the decimal number <sup>N</sup><sup>10</sup> <sup>=</sup> <sup>k</sup> <sup>i</sup>=1 <sup>n</sup><sup>i</sup> k <sup>j</sup>=i+1 <sup>b</sup><sup>j</sup> . Observing that the addition of leading zeros to a string does not change its decimal value, we may assume without loss of generality that n<sup>1</sup> > 0. Furthermore, b<sup>1</sup> does not affect the decimal value of the string, so we may omit it.

Now, define β<sup>n</sup> <sup>b</sup> (x) := bx + n. After rearranging, we see that the decimal value of the <sup>B</sup>-ary string <sup>N</sup> <sup>=</sup> <sup>n</sup>1n2b<sup>2</sup> ...nkb<sup>k</sup> may also be written as <sup>N</sup><sup>10</sup> = (β<sup>n</sup><sup>k</sup> <sup>b</sup><sup>k</sup> ◦ <sup>β</sup><sup>n</sup>k−<sup>1</sup> <sup>b</sup>k−<sup>1</sup> ◦ ···◦ <sup>β</sup><sup>n</sup><sup>2</sup> <sup>b</sup><sup>2</sup> )(n1). This gives us a string and a function view of the same representation, and we will switch between them as appropriate. In doing so, we also conflate the symbols and the corresponding functions, referring to β<sup>n</sup> <sup>b</sup> as nb.

As the last ingredient before describing the rewriting system, we observe that we can write (β<sup>n</sup> <sup>b</sup> ◦β<sup>m</sup> <sup>c</sup> )(x) = bcx+bm+n equivalently as another composition (β<sup>m</sup>- <sup>c</sup> ◦β<sup>n</sup>- <sup>b</sup> )(x) = cbx <sup>+</sup> cn <sup>+</sup> <sup>m</sup> for some suitable <sup>0</sup> <sup>≤</sup> <sup>n</sup> < b and <sup>0</sup> <sup>≤</sup> <sup>m</sup> < c. This allows us to swap the bases of adjacent positions while preserving the decimal value of the string.

From this point on, we constrain ourselves to the mixed {2, <sup>3</sup>}-ary (binary–ternary) representations as we shift our focus to simulating the Collatz function (noting that it is possible to adapt the rewriting system that we will end up with to other instances of the general case). More precisely, we simulate the following redefinition of the Collatz function where the odd case incorporates an additional division by 2.

$$T(n) = \begin{cases} \frac{n}{2} & \text{if } n \equiv 0 \pmod{2} \\\frac{3n+1}{2} & \text{if } n \equiv 1 \pmod{2} \end{cases}$$

We will describe an SRS <sup>T</sup> over the symbols {f, <sup>t</sup>, <sup>0</sup>, <sup>1</sup>, <sup>2</sup>, , } that simulates iterated application of the Collatz function and terminates upon reaching 1. The symbols f, t correspond to binary digits 02, 12; and 0, 1, 2 to ternary digits 03, 13, 23. The symbol marks the beginning of a string while also standing for the most significant digit (without loss of generality assumed to be 1) and marks the end of a string. Consider the functional view of these symbols:

$$\begin{array}{llll} \text{tf}(x) = 2x & \quad \text{\(\beta x\)} = 3x\\ \text{tf}(x) = 2x + 1 & \quad \text{\(\beta x\)} = 3x + 1 & \quad \text{\(\beta x\)} = x\\ \text{\(\beta x\)} = 3x + 2 & \quad \text{\(\beta x\)} = x \end{array} \tag{1}$$

Each positive natural number can be expressed as some composition of these functions, which corresponds to a string as per our previous discussion.

*Example 3.* Allowing the inclusion of a redundant trailing symbol to mixed base representations, we can write 19 = (0f1)<sup>10</sup> = (1(f(0((x))))). The string representation ends with a ternary symbol, so we will rewrite it. With the function view, we have 1(f(x)) = 3(2x)+1 = 6x + 1 = 2(3x)+1 = t(0(x)). This shows that we could also write 19 = (00t)10, which now ends with the binary digit 12. This gives us the rewrite rule f1 → 0t. We can now apply the Collatz function to this representation by rewriting only the rightmost two symbols of the string since T((t(x))) = 3(2x+1)+1 <sup>2</sup> <sup>=</sup> <sup>6</sup>x+4 <sup>2</sup> = 3<sup>x</sup> +2= ((2(x))). This gives us the rewrite rule <sup>t</sup> <sup>→</sup> <sup>2</sup>. After applying this rule, we indeed obtain <sup>T</sup>(19) = 29 = (002)10.

In the manner of the above example, we compute all the necessary transformations and obtain the following 11-rule SRS T .

$$\mathcal{D}\_{T} = \left\{ \begin{array}{c} \mathbf{f} \rhd \rightarrow \rhd \\ \mathbf{t} \rhd \rightarrow \mathscr{D} \end{array} \right\} \quad \mathcal{A} = \left\{ \begin{array}{c} \mathbf{f} \rhd \rightarrow \mathbf{0} \mathbf{f} \\ \mathbf{f} \rhd \rightarrow \mathbf{0} \mathbf{t} \\ \mathbf{f} \rhd \rightarrow \mathbf{0} \mathbf{t} \end{array} \; \begin{array}{c} \mathbf{t} \rhd \rightarrow \mathbf{1} \mathbf{t} \\ \mathbf{t} \rhd \rightarrow \mathbf{2} \mathbf{f} \\ \mathbf{t} \rhd \rightarrow \mathbf{2} \mathbf{t} \end{array} \right\} \quad \mathcal{B} = \left\{ \begin{array}{c} \mathbf{\spherical{\mathbf{c}} \mathbf{0} \rightarrow \mathbf{\color{red}{\mathbf{c}}} \mathbf{t} \\ \mathbf{\color{red}{\mathbf{c}} \mathbf{I} \rightarrow \mathbf{\color{red}{\mathbf{c}}} \mathbf{f} \mathbf{f} \\ \mathbf{\color{red}{\mathbf{c}} \mathbf{I} \rightarrow \mathbf{\color{red}{\mathbf{c}}} \mathbf{f} \mathbf{f} \end{array} \right\}$$

This SRS is split into subsystems <sup>D</sup><sup>T</sup> (dynamic rules for <sup>T</sup>) and <sup>X</sup> <sup>=</sup> A∪B (auxiliary rules). The two rules in <sup>D</sup><sup>T</sup> encode the application of the Collatz function <sup>T</sup>, while the rules in X serve to push binary symbols towards the rightmost end of the string by swapping the bases of adjacent positions without changing the represented value.

*Example 4 (Rewrite sequence of* <sup>T</sup> *).* Consider the string <sup>s</sup> <sup>=</sup> ff0 that represents the number <sup>12</sup>. Below is a possible rewrite sequence of <sup>T</sup> that starts from <sup>s</sup>, with the corresponding decimal values (under the interpretations from (1)) displayed above the strings. Underlines indicate the parts of the strings where the rules are applied.

12 12 6 6 3 3 5 ff0 <sup>→</sup><sup>A</sup> f0f <sup>→</sup><sup>D</sup><sup>T</sup> f0 <sup>→</sup><sup>A</sup> 0f <sup>→</sup><sup>D</sup><sup>T</sup> 0 <sup>→</sup><sup>B</sup> t <sup>→</sup><sup>D</sup><sup>T</sup> 2 5 8 8 8 4 21 <sup>→</sup><sup>B</sup> ft <sup>→</sup><sup>D</sup><sup>T</sup> f2 <sup>→</sup><sup>A</sup> 1f <sup>→</sup><sup>B</sup> fff <sup>→</sup><sup>D</sup><sup>T</sup> ff <sup>→</sup><sup>D</sup><sup>T</sup> f <sup>→</sup><sup>D</sup><sup>T</sup>

The trajectory of T continues upon reaching 1, however, in order to be able to formulate the Collatz conjecture as a termination problem, T is made in such a way that its rewrite sequences stop upon reaching the string representation of 1 since no rule is applicable.

Termination of the subsystems of T with B or D<sup>T</sup> removed is easily seen. However, since we have matrix interpretations at our disposal, let us give a compact proof.

Lemma 3. SN(T \B) *and* SN(T \D<sup>T</sup> )*.*

*Proof.* It is easily checked that the interpretations below show SN((T \B) rev), which implies SN(T \B) by Lemma 2.

$$[\mathfrak{t}](x) = [\mathfrak{t}](x) = 2x + 1 \qquad [\rhd] = x \qquad [\mathfrak{l}](x) = [\mathfrak{l}](x) = [\mathfrak{l}](x) = 2x$$

Below interpretations show SN((T \D<sup>T</sup> ) rev), which implies SN(T \D<sup>T</sup> ) by Lemma 2.

$$[\mathfrak{f}](x) = [\mathfrak{t}](x) = [\mathfrak{d}](x) = x+1 \qquad [\mathfrak{d}](x) = [\mathfrak{1}](x) = [\mathfrak{2}](x) = 4x \qquad \square$$

As a whole, the system <sup>T</sup> simulates the iterated application of <sup>T</sup> (except at <sup>1</sup>).

# Theorem 3. <sup>T</sup> *is terminating if and only if* <sup>T</sup> *is convergent.*

*Proof (sketch).* We observe that the rules of T do not change the number of occurrences of or in a string and that the rewrite sequences operate strictly on one side of these symbols. Thus, we may view a given string as split into blocks delimited by or and consider the termination of each block separately. In this way, we conclude that there exists a nonterminating rewrite sequence for a string if and only if it contains a block of the *canonical form* (f|t|0|1|2)∗ that can be rewritten indefinitely, since the rewrite sequences that start on blocks of all other forms are already seen to terminate by Lemma 3. Furthermore, under the interpretations in (1), the sequences of values attained by the rewrites of the blocks in canonical form correspond directly to Collatz trajectories, since the rules in X do not change the value of the block and the rules in D<sup>T</sup> change the value of the block in exactly the same way as the Collatz function <sup>T</sup>.

When trying to remove a rule in D<sup>T</sup> or B it suffices to show relative top termination, allowing us to use weakly (instead of extended) monotone algebras when applying Theorem 1 and take advantage of the more relaxed constraints when searching for matrix/arctic interpretations. The lemma below encapsulates dependency pairs, and it can in fact be automatically proved via the dependency pair framework [9].

Lemma 4. *For each subset* R⊆B*, if* SN(Rtop/<sup>T</sup> ) *then* SN(R/<sup>T</sup> )*. And, for each subset* R⊆D<sup>T</sup> *, if* SN(Rrev top/<sup>T</sup> rev) *then* SN(Rrev/<sup>T</sup> rev)*.*

*Proof (sketch).* Without loss of generality, assume we start with a string of the canonical form (f|t|0|1|2)∗ (resp. its reversal). Then, the rules in <sup>B</sup> (resp. <sup>D</sup><sup>T</sup> rev) can only be applied at the top level. As we know from Lemma 3 that T \B (resp. T \D<sup>T</sup> ) is terminating, any infinite sequence of rewrites in T (resp. its reversal) would require infinitely many applications of the rules from <sup>B</sup> (resp. <sup>D</sup><sup>T</sup> rev). As these rules can only be applied at the top level, this would imply relative top nontermination.

## 4 Automated Proofs

We adapt the rewriting system T to different generalized Collatz functions to explore the effectiveness of the mixed base scheme on weakened variants of the Collatz conjecture. The rewriting systems, scripts to reproduce the experiments, and our implementation of a termination prover are available at https://github.com/emreyolcu/rewriting-collatz.

Most top-tier termination tools, such as AProVE, Matchbox, and TTT2, use the SAT solver MiniSat [5] to search for matrix/arctic interpretations. This choice is somewhat surprising as MiniSat has not been updated since 2008 and the performance of SAT solvers has improved significantly in the last decade. The use of MiniSat in these provers is motivated by its observed effectiveness in finding interpretations. We investigated the reason for this, which turned out to be a heuristic that MiniSat disables in its default configuration. MiniSat uses negative branching [5], which explores the "false" branch first for all decision variables. Modern SAT solvers use phase-saving [22] which first explores the branch corresponding to the truth value to which the variable was forced to most recently during unit propagation. In our case, enabling negative branching improves solver performance for formulas that encode the existence of interpretations.

#### 4.1 Convergence of *W*

With the mixed binary–ternary scheme, the function W from Section 3.1 can be seen to be simulated by the system <sup>W</sup> <sup>=</sup> {f <sup>→</sup> <sup>0</sup>}∪X . A small matrix interpretations proof is found for this system in less than a second, in contrast to its variant W that uses unary representations for which no automated proof is known.

#### Theorem 4. SN(W )*.*

*Proof.* The interpretations below prove SN({<sup>f</sup> <sup>→</sup> 0}/<sup>X</sup> rev):

$$[\vec{x}](\vec{x}) = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \vec{x} + \begin{bmatrix} 1 \\ 1 \end{bmatrix} \qquad [\varepsilon](\vec{x}) = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} \vec{x} + \begin{bmatrix} 1 \\ 0 \end{bmatrix}$$

$$[\circ](\vec{x}) = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} \vec{x} \qquad [\circ](\vec{x}) = \begin{bmatrix} 1 & 2 \\ 0 & 0 \end{bmatrix} \vec{x}$$

$$[\circ](\vec{x}) = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \vec{x} + \begin{bmatrix} 2 \\ 0 \end{bmatrix} \qquad [\circ](\vec{x}) = \begin{bmatrix} 1 & 0 \\ 1 & 0 \end{bmatrix} \vec{x} + \begin{bmatrix} 2 \\ 2 \end{bmatrix} \qquad [\mathbb{Z}](\vec{x}) = \begin{bmatrix} 1 & 0 \\ 1 & 0 \end{bmatrix} \vec{x} + \begin{bmatrix} 2 \\ 2 \end{bmatrix}$$

By Lemmas <sup>3</sup> and 2, <sup>X</sup> rev is terminating. As a result, <sup>W</sup>rev is terminating, which by Lemma 2 implies that W is terminating.

#### 4.2 Farkas' Variant

Let <sup>2</sup><sup>N</sup> +1= {1, <sup>3</sup>, <sup>5</sup>,...} denote the odd natural numbers. Farkas [8] studied a slight modification <sup>F</sup> : 2<sup>N</sup> + 1 <sup>→</sup> <sup>2</sup><sup>N</sup> + 1 of the Collatz function which can be proved convergent via induction. We consider automatically proving the convergence of this function as another test case for the mixed base scheme that is easier than the Collatz conjecture without being entirely trivial. We refer the reader to [8] for the original definition of F . Below, we define another function <sup>F</sup> : <sup>N</sup> <sup>→</sup> <sup>N</sup> that resembles the Collatz function more closely than Farkas' F (with respect to the definitions of the cases) while being equivalent to F in terms of convergence. This variant is obtained by introducing an additional case in the Collatz function for <sup>n</sup> <sup>≡</sup> 1 (mod 3) and applying <sup>T</sup> otherwise. Its definition and a set <sup>D</sup><sup>F</sup> of dynamic rules are shown below.

$$F(n) = \begin{cases} \frac{n-1}{3} & \text{if } n \equiv 1 \pmod{3} \\ \frac{n}{2} & \text{if } n \equiv 0 \text{ or } n \equiv 2 \pmod{6} \\ \frac{3n+1}{2} & \text{if } n \equiv 3 \text{ or } n \equiv 5 \pmod{6} \end{cases} \qquad \mathcal{D}\_F = \left\{ \begin{array}{l} 1 \flat \to \nwarrow{} \\ 0 \not\to \nwarrow{} \\ 1 \not\ni \not\to 1 \not\ni \\ 2 \not\ni \ni \end{array} \right\}$$

Termination of the rewriting system <sup>F</sup> <sup>=</sup> <sup>D</sup><sup>F</sup> ∪ X is equivalent to the convergence of <sup>F</sup>. The proof of the equivalence is essentially the same as that of Theorem 3. Farkas gave an inductive proof of convergence for F via case analysis, and we found an automated proof that F is terminating via arctic interpretations. It is worth mentioning that the default configurations of the existing termination tools (e.g., AProVE, Matchbox) are too conservative to prove termination of this system, but after their authors tweaked the strategies they were also able to find automated proofs via arctic interpretations.

# Theorem 5. *For all* <sup>n</sup> <sup>∈</sup> <sup>N</sup>+*, the trajectory* <sup>F</sup><sup>τ</sup> (n) *contains* <sup>1</sup>*.*

*Proof.* We will show SN(F). By Lemmas <sup>3</sup> and 2, we have SN(<sup>X</sup> rev). The arctic interpretations below (with the empty cells standing for −∞) prove SN(D<sup>F</sup> rev top/<sup>X</sup> rev) by Theorem 1, which implies SN(D<sup>F</sup> rev/<sup>X</sup> rev) by Lemma 4. As we know <sup>X</sup> rev is terminating, by Lemma <sup>1</sup> we conclude SN(D<sup>F</sup> rev ∪ X rev), implying SN(F) via Lemma 2.

$$\begin{aligned} [\vec{x}](\vec{x}) &= \begin{bmatrix} 2 \\ 2 \\ 2 \\ -1 \end{bmatrix} \vec{x} + \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \end{bmatrix} \qquad [t](\vec{x}) = \begin{bmatrix} 0 & 2 & 0 \\ 0 & 2 & 0 \\ 2 & 2 & 0 \\ & & 1 \end{bmatrix} \vec{x} + \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \end{bmatrix} \\ [\circ](\vec{x}) &= \begin{bmatrix} 0 & 4 & 0 \\ 2 & 0 & 0 \\ 4 & 0 & 1 \\ 0 & 3 & 0 \end{bmatrix} \vec{x} \qquad [\circ](\vec{x}) = \begin{bmatrix} 0 & 0 \\ 0 & 4 \\ 4 & 0 \\ 0 & 3 \end{bmatrix} \vec{x} \qquad [\vec{z}](\vec{x}) = \begin{bmatrix} 0 & 0 \\ 4 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix} \vec{x} \end{aligned}$$

# 4.3 Subsets of *T*

It is also interesting to consider whether we can automatically prove terminations of proper subsets of T . Specifically, we considered the 11 subsystems obtained by leaving out a single rewriting rule from T , and we found proofs via matrix/arctic interpretations for all of the 11 subproblems. The reason for our interest in these problems is threefold:


*Example 5.* As an instance of leaving out a rule, consider the subsystem T \{f1 → 0t}. There is a single-step matrix interpretations proof that this system is terminating:

$$[\vec{\pi}](\vec{x}) = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \vec{x} \qquad [\text{t}](\vec{x}) = \begin{bmatrix} 1 & 3 \\ 3 & 4 \end{bmatrix} \vec{x} + \begin{bmatrix} 1 \\ 1 \end{bmatrix}$$

$$[\circ](\vec{x}) = \begin{bmatrix} 1 & 5 \\ 0 & 0 \end{bmatrix} \vec{x} \qquad [\circ](\vec{x}) = \begin{bmatrix} 1 & 0 \\ 1 & 0 \end{bmatrix} \vec{x} + \begin{bmatrix} 1 \\ 1 \end{bmatrix}$$

$$[\circ](\vec{x}) = \begin{bmatrix} 7 & 2 \\ 2 & 5 \end{bmatrix} \vec{x} + \begin{bmatrix} 2 \\ 1 \end{bmatrix} \qquad [1](\vec{x}) = \begin{bmatrix} 2 & 1 \\ 1 & 1 \end{bmatrix} \vec{x} + \begin{bmatrix} 1 \\ 0 \end{bmatrix} \qquad [2](\vec{x}) = \begin{bmatrix} 2 & 2 \\ 2 & 4 \end{bmatrix} \vec{x} + \begin{bmatrix} 0 \\ 2 \end{bmatrix}$$

With the above interpretations, we can show for instance that the Collatz trajectory starting at 3 (represented as t) is convergent, because the missing rule is not used in any derivation of 1 () from 3. Below is an example derivation along with the decimal values each string represents and a vector value of each string under the interpretations above (setting x = (0, 0) for the purpose of demonstration). We omit the subscripts from the rewrite relations and simply write →.

$$\begin{array}{ccccccccc} 3 & 5 & 5 & 8 & 8 & 8 & 4 & 2 & 1\\ \cline{2-4} \boxplus \begin{array}{c} \boxplus \begin{array}{c} \boxplus \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxplus \begin{array}{c} \boxplus \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxplus \begin{array}{c} \boxminus \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxplus \begin{array}{c} \boxminus \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxminus \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxminus \begin{array}{c} \boxminus \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxminus \begin{array}{c} \boxminus \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxminus \begin{array}{c} \boxminus \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxminus \begin{array}{c} \boxminus \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxminus \begin{array}{c} \boxash \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxminus \begin{array}{c} \boxash \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxminus \begin{array}{c} \boxash \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxminus \begin{array}{c} \boxash \begin{array}{c} \\ \end{array} \end{array} \rightarrow \begin{array}{c} \boxminus \begin{array}{c} \boxash \begin{array}{c} \\ \end{$$

Table 1 shows the parameters for the proofs that we found for the termination of each subsystem. For each rule <sup>→</sup> <sup>r</sup> that is left out, we searched for a stepwise proof to show that B\{ <sup>→</sup> <sup>r</sup>} is terminating relative to T \{ <sup>→</sup> <sup>r</sup>} (freely utilizing weakly monotone


Table 1. Smallest proofs found for terminations of subsystems of T in under 120 seconds. The columns show the matrix dimension d and the maximum number v of distinct coefficients that appear in the matrices, along with the median time to find an entire termination proof across 10 repetitions for the fixed d and v.

algebras due to Lemma 4). Such a proof requires at most three steps since there are at most three rules in B\{ <sup>→</sup> <sup>r</sup>}. On the table, we report the smallest parameters (in terms of matrix dimension) that work for all of these steps. As we already know that SN(T \B) holds (by Lemma 3), the interpretations found allow us to conclude the termination of each subsystem. This is not the only way to prove the terminations of the subsystems, however, we chose this uniform strategy for the sake of comparison.

#### 4.4 Odd Trajectories

In the originally defined Collatz function <sup>C</sup>, applying <sup>2</sup><sup>n</sup> + 1 → <sup>6</sup><sup>n</sup> + 4 produces an even number, so we incorporate a single division by 2 into the definition of the odd case and obtain the function T with the same overall dynamics as C. Taking this idea further by performing as many divisions by 2 as possible leads to the socalled Syracuse function Syr: 2<sup>N</sup> + 1 <sup>→</sup> <sup>2</sup><sup>N</sup> + 1, defined as Syr(n) = <sup>3</sup>n+1 <sup>2</sup><sup>k</sup> where <sup>k</sup> = max{<sup>k</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup> <sup>|</sup> <sup>2</sup><sup>k</sup> divides <sup>3</sup><sup>n</sup> + 1}.

Expressing the Syracuse function as a generalized Collatz function would require infinitely many cases to account for all of the possible appearances of 2<sup>k</sup> as the denominator with different values of k. As a result, we are unable to simulate it with a finite rewriting system. Nevertheless, we may compromise and accelerate the Collatz function by a constant amount. We first observe that if <sup>n</sup> <sup>≡</sup> 1 (mod 8) then Syr(n) = <sup>3</sup>n+1 4 and if <sup>n</sup> <sup>≡</sup> 3 (mod 4) then Syr(n) = <sup>3</sup>n+1 <sup>2</sup> . Furthermore, for any <sup>n</sup> <sup>∈</sup> <sup>N</sup> we have Syr(8n + 5) = Syr(2n + 1) since 3(8n + 5) + 1 = 24n + 16 = 4(6n + 4) = 4(3(2n + 1) + 1). Putting these observations together, we can define a generalized Collatz function <sup>S</sup> : 2<sup>N</sup> + 1 <sup>→</sup> <sup>2</sup><sup>N</sup> + 1 as follows.

$$S(n) = \begin{cases} \frac{3n+1}{4} & \text{if } n \equiv 1 \pmod{8} \\\frac{n-1}{4} & \text{if } n \equiv 5 \pmod{8} \\\frac{3n+1}{2} & \text{if } n \equiv 3 \pmod{4} \end{cases}$$

S is convergent if and only if C (or T) is convergent, and the number of steps that S takes to converge is between that of T and Syr. In a manner similar to before, we

Fig. 1. Transition graphs of the iterates in the Collatz trajectories across residue classes modulo 8 for the functions <sup>C</sup> (left), <sup>T</sup> (middle), <sup>S</sup> (right). For each function <sup>f</sup>, the edge <sup>u</sup> <sup>→</sup> <sup>v</sup> is part of its transition graph if and only if there exists some <sup>n</sup> <sup>≡</sup> <sup>u</sup> (mod 8) such that <sup>f</sup>(n) <sup>≡</sup> <sup>v</sup> (mod 8). Bold edges indicate transitions where f(n) > n.

can translate <sup>S</sup> into a rewriting system <sup>S</sup> <sup>=</sup> {ff• → <sup>0</sup>•, tf•→•, <sup>t</sup>• → <sup>2</sup>•} ∪ X . Since we are working with odd numbers we used a new symbol • to mark the end of a string, viewed functionally as •(x)=2<sup>x</sup> + 1. Termination of the rewriting system <sup>S</sup> is equivalent to the convergence of <sup>S</sup>. Similar to <sup>T</sup> , proving the termination of <sup>S</sup> is currently beyond our reach, although it may potentially be an easier path to the Collatz conjecture (compared to proving SN(T )). Failing to prove the termination of S itself, we considered the subsystems of S as we did for T in Section 4.3. With matrix/arctic interpretations, the terminations of all but two of the 11-rule subsystems of S were automatically proved. Despite devoting thousands of CPU hours, we were not able to find interpretations to prove that S<sup>1</sup> = S\{ff• → 0•} or S<sup>2</sup> = S\{tf• → •} is terminating, so we leave them as challenges for automated termination proving.

#### 4.5 Collatz Trajectories Modulo **8**

Let <sup>m</sup> be a power of <sup>2</sup>. Given <sup>k</sup> ∈ {0, <sup>1</sup>,...,m−1}, is it the case that all nonconvergent Collatz trajectories contain some <sup>n</sup> <sup>≡</sup> <sup>k</sup> (mod <sup>m</sup>)? For several values of <sup>k</sup> this can be proved to hold by inspecting the transitions of the iterates in the Collatz trajectories across residue classes modulo m (shown on Figure 1 for m = 8). These questions can also be formulated as the terminations of some rewriting systems. With this approach we found automated proofs for several cases:

Theorem 6. *If there exists a nonconvergent Collatz trajectory, it cannot avoid the residue classes of* 2*,* 3*,* 4*,* 6 *modulo* 8*.*

It remains open whether the above holds for the residue classes of 0, 1, 5, 7 modulo 8.

# 5 More Problems to Approach via Rewriting

*Mahler's* <sup>3</sup>/<sup>2</sup> *Problem.* Let <sup>ξ</sup> <sup>∈</sup> <sup>R</sup><sup>&</sup>gt;<sup>0</sup> be a real number. It is called a <sup>Z</sup>*-number* if for all <sup>k</sup> <sup>∈</sup> <sup>N</sup> we have frac ξ 3 2 k < <sup>1</sup> <sup>2</sup> , where frac(·) denotes the fractional part of the

number. Mahler [20] conjectured that there are no Z-numbers. Moreover, he considered a generalized Collatz function <sup>M</sup> : <sup>N</sup><sup>+</sup> <sup>→</sup> <sup>N</sup>+, defined as follows.

$$M(n) = \begin{cases} \frac{3n}{2} & \text{if } n \equiv 0 \pmod{2} \\\frac{3n+1}{2} & \text{if } n \equiv 1 \pmod{4} \\\bot & \text{if } n \equiv 3 \pmod{4} \end{cases}$$

He related the behaviors of M-trajectories to the existence of Z-numbers:

Theorem 7. *For* <sup>n</sup> <sup>∈</sup> <sup>N</sup><sup>+</sup>*, if a* <sup>Z</sup>*-number exists in the interval* [n, n + 1)*, then there is no* <sup>k</sup> <sup>∈</sup> <sup>N</sup> *for which* <sup>M</sup><sup>k</sup>(n) <sup>≡</sup> 3 (mod 4)*.*

Thus, the nonexistence of Z-numbers can be established by proving that M is convergent, which is equivalent to the termination of <sup>M</sup> <sup>=</sup> {f <sup>→</sup> <sup>0</sup>,ft <sup>→</sup> <sup>10</sup>}∪X . In order to ensure termination at the case <sup>n</sup> <sup>≡</sup> 3 (mod 4), there is no rule with the LHS tt.

*Halting Problem for Busy Beaver-5.* The busy beaver problem concerns finding binaryalphabet Turing machines with n states that, when given an input tape of all 0s, write the largest number of 1s on the tape upon halting. For each n, the machine that achieves this is called the "Busy Beaver-n". Note that this definition only requires the machines to halt on all-0 inputs, leaving the behavior on other inputs unspecified and allowing them not to halt in general. Michel [21] observed that for <sup>n</sup> ∈ {2, <sup>3</sup>, <sup>4</sup>}, the busy beaver machines are all *total Turing machines*, i.e., they halt on all inputs, and moreover proved that they all simulate some generalized Collatz function. It is an open problem whether all busy beavers are total. In particular, it is unknown whether the current Busy Beaver-5 candidate is total. Michel showed that the Busy Beaver-5 candidate simulates the following generalized Collatz function.

$$B(n) = \begin{cases} \frac{5n+18}{3} & \text{if } n \equiv 0 \pmod{3} \\\frac{5n+22}{3} & \text{if } n \equiv 1 \pmod{3} \\\bot & \text{if } n \equiv 2 \pmod{3} \end{cases}$$

Convergence of the above function can be studied via the termination of a rewriting system obtained by a mixed {3, <sup>5</sup>}-ary (ternary–quinary) translation scheme. We were unable to prove the termination of the resulting system.

*Ternary Expansions of* 2<sup>n</sup>*.* Erdos [ ˝ 7] asked: When does the ternary expansion of 2<sup>n</sup> omit the digit 2? This is the case for 2<sup>0</sup> = (1)3, 2<sup>2</sup> = (11)3, and 2<sup>8</sup> = (100111)3. He conjectured that it does not happen for n > 8. This conjecture can be proved by showing that the rewriting system <sup>E</sup> <sup>=</sup> {0 <sup>→</sup> , <sup>1</sup> <sup>→</sup> , <sup>→</sup> }∪{<sup>r</sup> <sup>→</sup> <sup>|</sup> <sup>→</sup> <sup>r</sup> ∈ X} is terminating on all initial strings of the form f<sup>8</sup>f<sup>+</sup>. Given a string that corresponds to the binary representation of a power of 2, this system essentially rewrites the string into ternary by pushing ternary symbols to the right without altering the value that the string represents, and removes the occurrences of the ternary digits 0 and 1 (but not 2). If the ternary expansion does not contain the digit 2 then all digits will be removed, resulting in the string that can then be rewritten to itself indefinitely. This problem, as described, is an instance of "local termination" [28] since it is concerned with termination on not all possible strings but a subset of them. We have not performed experiments with this system or local termination yet and we leave this for future work.

# 6 Related Work

To our knowledge, Zantema [29], with his system Z that we saw in Section 3.1, was the first to attempt using an automated method and string rewriting to search for a proof of the Collatz conjecture. In addition, although we independently discovered the mixed binary–ternary system described in Section 3.2, Scollo [25] had essentially the same idea, the difference being that he adopted a functional view of the digits that is slightly different than in (1). Scollo was not concerned with proving termination, though, and proposed rewriting primarily as a formalism that forgoes the arithmetic interpretation of the iterates and instead emphasizes its dynamic/computational behavior.

De Mol [4] showed the existence of a small 2-tag system [23] with the following rules that simulates the iterated application of the Collatz function given a unary representation: {<sup>1</sup> , 1, 111}. This tag system halts if and only if the Collatz conjecture holds, giving yet another formulation of the problem.

Kari [11] designed 1D cellular automata that perform multiplication by 3 and 3/2 in base 6, and reformulated both the Collatz conjecture and Mahler's 3/2 problem as sets of constraints to be satisfied by the space-time diagrams of these cellular automata.

Kauffman [13] developed a formalism to perform arithmetic that he called *string arithmetic*, and expressed the Collatz conjecture within it. This formalism works with unary representations of numbers, and uses the three symbols 1, , . Letting denote the empty string and N be any string representing a number, string arithmetic consists of the following bidirectional rewrite rules (or "identities") to convert between different strings representing the same number: { ←→ , <sup>11</sup> ←→ 1, 1N ←→ N1}. Then, the Collatz function is encoded by the following two rules: {N <sup>→</sup> <sup>N</sup>, N<sup>1</sup> <sup>→</sup> N1N}. The Collatz conjecture is equivalent to the question of whether for strings of 1s of all lengths there exists a rewrite sequence using the five rules above to reach the string 1.

# 7 Future Work

Several extensions to this work can further our understanding of the potential of rewriting techniques for answering mathematical questions. For instance, although matrix/arctic interpretations lead to automated proofs of several weakened variants discussed in this paper, it might still be the case that there exists no matrix/arctic interpretation to establish the termination of the Collatz system T . Proving nonexistence would provide guidance as to where to focus our efforts when searching for a proof. Another issue is the matter of representation, specifically, it is worth exploring whether there exists a suitable translation of the Collatz conjecture into a term, instead of string, rewriting system since many automated termination proving techniques are generalized to term rewriting. Finally, injecting problem-specific knowledge into the rewriting systems or the termination techniques would be helpful as there exists a wealth of information about the Collatz conjecture that could simplify proof search.

*Acknowledgments.* We thank Jeffrey Lagarias, Florian Frohn, Johannes Waldmann, Carsten Fuhs, Jürgen Giesl, Luke Schaeffer, and Chris Lynch for discussions. We thank Jeremy Avigad, Jasmin Blanchette, and reviewers of CADE for their detailed comments on an earlier draft. This work was supported by NSF under grant CCF-2006363.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Verified Interactive Computation of Definite Integrals**

Runqing Xu<sup>1</sup>,<sup>2</sup> ID , Liming Li<sup>1</sup> ID , Bohua Zhan<sup>1</sup>,2(-) ID

<sup>1</sup>SKLCS, Institute of Software, Chinese Academy of Sciences, Beijing, China <sup>2</sup>University of Chinese Academy of Sciences, Beijing, China {xurq, lilm, bzhan}@ios.ac.cn

**Abstract.** Symbolic computation is involved in many areas of mathematics, as well as in analysis of physical systems in science and engineering. Computer algebra systems present an easy-to-use interface for performing these calculations, but do not provide strong guarantees of correctness. In contrast, interactive theorem proving provides much stronger guarantees of correctness, but requires more time and expertise. In this paper, we propose a general framework for combining these two methods, and demonstrate it using computation of definite integrals. It allows the user to carry out step-by-step computations in a familiar user interface, while also verifying the computation by translating it to proofs in higher-order logic. The system consists of an intermediate language for recording computations, proof automation for simplification and inequality checking, and heuristic integration methods. A prototype is implemented in Python based on HolPy, and tested on a large collection of examples at the undergraduate level.

**Keywords:** Symbolic integration, User interface, Proof automation

## **1 Introduction**

Symbolic computation is an important tool in mathematics, science, and engineering. It forms a key part of many mathematical proofs. On the engineering side, justifications for the design of signal processing and control systems contain extensive symbolic computations [6,33], involving derivatives and integrals, Laplace and Fourier transforms, and various special functions.

Typically, these computations can be performed using computer algebra systems such as Mathematica, Maple, and Maxima. Given the complexity of the task, it is not surprising that even the best of these systems are liable to errors. One famous example is - 1 −1 √ x<sup>2</sup> dx, which an early version of Maple evaluates to zero [23] (the error has been fixed in the more recent versions). Bugs in Mathematica have also been observed by mathematicians [15], including evaluation of determinants of matrices with large integer entries, and several evaluations of integrals (also fixed in the most recent version). While some errors are simply implementation mistakes, more systematic errors in symbolic computation

may arise due to neglect of checking side conditions, involving concepts such as well-definedness of expressions, singularities, convergence, and so on. While individual bugs can be reported and fixed, completely eliminating the possibility of error would require a more systematic approach.

Formalization of mathematics in interactive theorem provers promises to eventually achieve this goal. There is already a lot of work on formalization of analysis and linear algebra in interactive theorem provers, as well as verified computations based on the formalized theories. They provide much stronger guarantees of correctness, and also allow users to specify more detailed steps, enabling computations that are too difficult to be found automatically by computer algebra systems. However, a major disadvantage (for now) is that interactive theorem proving requires a great deal of time and expertise on the part of the user, making it difficult to apply on a much larger scale.

It is therefore natural to try to combine the advantages of computer algebra systems with theorem proving. There have already been many works in this direction. A common approach, proposed by Harrison and Th´ery [20,23], is to invoke a computer algebra system for computations that are difficult to perform, but whose results can be verified more easily. This greatly extends the capability of proof assistants for tasks such as factorization [23], linear arithmetic [28], etc. However, to use such a system, the user still needs expertise in the use of proof assistants, and the range of applicability is limited by the simple proof automation that is available for checking results.

In this paper, we propose a more general framework for verified symbolic computation in theorem provers, and demonstrate it using computation of definite integrals. The resulting system allows users to perform calculations of definite integrals step-by-step, in a user interface similar to that of a computer algebra system, but with the computations verified by automatic translation to proofs in higher-order logic. We choose definite integration for demonstration purposes, due to the great variety of techniques that can be used, but we intend the idea to be applicable to other kinds of symbolic computations.

The framework consists of several components. At the top, a graphical user interface displays the current computation and allows user actions. The user interface produces computations in a standard format. Next, proof automation is used to reconstruct from the computation a proof in higher-order logic. Finally, the proof depends on theorems in mathematics, e.g. (in the case of definite integration) those concerning continuity, derivatives, and integrals.

We implement a prototype based on HolPy, a new interactive theorem prover written in Python [49]. The SymPy package for symbolic computation in Python is used at various places for untrusted computations. The user interface is written in JavaScript as a web application, using Python as backend for convenient invocation of HolPy and SymPy libraries. The underlying theorems in analysis are mostly translated to HolPy from HOL Light (with some modifications). Their proofs have not been fully formalized in HolPy, hence the statements of these theorems still need to be trusted.

We now give an outline for the rest of this paper. Section 2 presents the overall framework. Section 3 describes the intermediate format for recording computations of definite integrals. In Section 4 and 5, we describe respectively the user interface and the proof reconstruction process. In Section 6 we present an evaluation of the system, along with some interesting examples. Finally, we conclude in Section 7 with discussion of possible future work.

*Related work.* There is a huge body of work on formal verification of continuous and hybrid systems, based on reachability checking [4], computation of invariants [36,41], deductive methods [34,35,47], and so on. In particular, KeYmaera X [18] provides a user interface for verifying hybrid systems using differential dynamic logic, with automatic generation of proofs checkable in Isabelle [9]. Most of this work focuses on automatic verification and/or logical formalisms. Our work can be seen as complementary, focusing on verifying symbolic reasoning about mathematical concepts such as special functions and integration, which can also form a part of the justification of control systems.

Harrison and Th´ery proposed the "skeptical" approach for combining theorem provers with computer algebra systems [20,23]. Some common applications include factorization of polynomials, which is further applied to verify antiderivatives involving sine and cosine [23]. More recently, this technique is used by Chyzak et al. to formalize the proof of irrationality of ζ(3) [14], and by Harrison to verify proofs of hypergeometric sums found using the WZ method [22]. Similar approaches are implemented in Isabelle [8], PVS [3] and Lean [28]. Compared to this work, we present more complex proof automation for reconstructing proofs, as well as a user interface for allowing users to perform multi-step computations in a more familiar setting. Other user interfaces for proof assistants with support for displaying mathematical computations include Theorema [11] and jsCoq [5].

The theory of integration has been formalized in every major proof assistant [12,24,31,40,43]. Recently, more advanced concepts that are important in science and engineering have been formalized, including the work by Hasan et al. on Fourier and Laplace transforms [37,38,46], and Immler et al. on ordinary differential equations [25,26]. Work has also been done on formalizing advanced concepts in linear algebra [29], with applications in analyzing mechanical systems [13,44]. Of course, formalized symbolic computation can be applied in many other domains. For example, Selsam et al. [42] verified in Lean the correctness of stochastic backpropagation, an important algorithm in deep learning.

Slagle initiated the study of automatic integration with a heuristic method [45]. Later research focused more on methods that are complete for certain types of integrands, such as Risch's algorithm [19]. More recently, Rubi (rule-based integration) has been demonstrated to be a powerful technique [39]. However, none of these work focuses on formal verification. A verified computation of asymptotics for real-valued functions is implemented by Eberl [16]. Verified *numerical* computation of definite integrals is implemented by Mahboubi et al. [30].

*Acknowledgements.* This work was partially supported by the National Natural Science Foundation of China under Grant Nos. 62002351, 62032024, and the Chinese Academy of Sciences Pioneer 100 Talents Program under Grant No. Y9RC585036.

# **2 Overall Architecture**

In this section, we describe the overall architecture of the system, leaving descriptions of its components to the following sections. We focus on definite integrals of continuous functions in one variable over closed intervals. In particular, we consider expressions given by the following syntax:

e := v | c | e<sup>1</sup> op e<sup>2</sup> | f(e) | Deriv(e, v) | Integral(e, v, a, b)

Here v is a variable; c is a constant (either a rational number or π); op is an arithmetic operation (+, −, ×, ÷ and exponentiation); f is a special function (such as logarithms, exponentials, or trigonometric functions); Deriv(e, v) denotes the derivative of e with respect to variable v; Integral(e, v, a, b) denotes the definite integral of e with respect to variable v over the interval [a, b]. In the rest of this paper, we will use both concrete syntax and LATEX form of expressions. We use *locations* to point to particular subexpressions. A location is given by a sequence of natural numbers (written in the form n1.n<sup>2</sup> ...nk, with each n<sup>i</sup> starting from zero), specifying the path to a subtree in the abstract syntax tree of an expression. For example, in the expression

$$1 + \mathsf{Integral}(1 + \sin^3(x), x, 0, 1)$$

the location of sin<sup>3</sup>(x) is given by 1.0.1.

A computation is represented as a list of steps, with each step specifying a rewriting of the current expression. Each step should provide sufficient information so that both checking its correctness and proof generation can be performed relatively easily. A computation begins with the integral to be evaluated, and ends with an expression in simplified closed form. Each step contains the name of the rule used, the location in the expression at which it is applied, and the expected result of applying the step. A step may contain additional parameters and certificates needed for verification. Rules of integration include substitution, integration by parts, use of a trigonometric identity, and so on (described in detail in Section 3). For example, integration by parts takes as parameters two expressions u and v, such that f · dx = u · dv where f is the integrand of the integral at the given location.

A graphical user interface allows the user to specify a computation in ways similar to using a computer algebra system. The user interface displays the computation in LATEX or in text form. At each step, the user selects part of the current expression to focus on, then selects an action from the menu. Depending on the selected action, the user may need to enter some of the parameters, while the other parameters are automatically inferred by the system. After checking validity of inputs, the user interface computes the result of the action. A package for symbolic computation may be invoked at this step.

There are many side conditions that need to hold in order for a computation step to be correct, some of which may not be caught at the user interface. Translation of the computation to proofs in higher-order logic greatly increases our confidence in the computation and can point out potential errors. In this work, we translate the computation to higher-order logic proofs in HolPy. One main difficulty is implementing sufficiently powerful proof automation for simplification of expressions, inequality checking, and other side conditions. We demonstrate that the API for proof automation in HolPy is sufficiently powerful for this purpose. However, note the representation of a computation is independent from any particular proof assistant, so additional proof translation may be implemented for other proof assistants.

Finally, various algorithms for integration (such as Slagle's method [45]) may be implemented to perform several steps of computation at once. We implemented Slagle's method and have it as one of the options at the user interface.

The overall framework is shown in the following diagram.

Here solid boxes and arrows indicate parts that are implemented for this paper. The analysis library is only partially formalized. Dotted arrows indicate possible future extensions.

This layered design can be viewed as a separation of concerns. At the top, the user only need to think about how to evaluate an integral in general mathematical terms. The implementation of integration algorithms only involves computer algebra. Proof automation involves algorithms for constructing proofs in the underlying logic. Finally, building a library in analysis involves working with a proof assistant. All these are put together to enable verification of potentially difficult symbolic integration by producing proofs in higher-order logic or other logical formalisms. In the following three sections, we describe the top three layers of the system in more detail.

#### **3 Integration Rules**

Rules of integration define the language for recording computations. Each rule may take additional parameters (as described below), as well as a location parameter specifying the subexpression the rule is applied on.

#### **3.1 Simplification**

The rule **Simplification** rewrites an expression to an equivalent simpler form. The details of simplification depends on the implementation. Here we only specify in broad terms what is and is not simplified. These choices are made mainly considering the ease of performing simplifications, and having a clearly defined "simplified form". We do expand products of polynomials and combine terms (e.g. from (<sup>x</sup> + 1)(<sup>x</sup> <sup>−</sup> 1) to <sup>x</sup><sup>2</sup> <sup>−</sup> 1). We do not reduce quotients of polynomials (e.g. from (x<sup>3</sup> + 1)/(x<sup>2</sup> + 1) to <sup>x</sup> <sup>−</sup> (<sup>x</sup> <sup>−</sup> 1)/(x<sup>2</sup> + 1), and from 2/(x<sup>2</sup> <sup>−</sup> 1) to <sup>1</sup>/(<sup>x</sup> <sup>−</sup> 1) <sup>−</sup> <sup>1</sup>/(<sup>x</sup> + 1)). We do not automatically expand powers (e.g. (<sup>x</sup> + 1)<sup>5</sup>). We do simplify values of trigonometric functions (e.g. from sin( <sup>π</sup> <sup>4</sup> ) to <sup>√</sup>2/2, and from sin( <sup>π</sup> <sup>2</sup> − x) to cos x), but do not use other trigonometric identities. We do evaluate derivatives and apply a fixed list of basic integrals, including linearity, powers, sine, cosine, exponential, and derivatives of trigonometric functions.

One complication is that certain rewrite rules contain side conditions. For example, it is only possible to simplify <sup>√</sup>xy to <sup>√</sup><sup>x</sup> · √y when both x and y are nonnegative. Likewise (x<sup>2</sup>) 1 <sup>2</sup> can be simplified to x<sup>2</sup>· <sup>1</sup> <sup>2</sup> = x only if x is nonnegative (otherwise the mistake mentioned in the introduction would result). When simplifying an integrand of an integral in x, we assume that x is within the open domain of integration, and perform simplification only if it is allowed by this assumption.

#### **3.2 Trigonometric Identities**

Application of trigonometric identities can be very tricky. It is often necessary to use trigonometric identities to rewrite an expression to a more complex form, in order to prepare for a substitution or integration by parts.

We use the classification of trigonometric identities by Fu et al. [17], which is implemented in SymPy (sympy.simplify.fu). In this scheme, trigonometric identities are classified into several groups with names of the form TRi. Some commonly used groups are shown below (rewriting from left to right):


The **Rewrite trigonometric** rule rewrites using one group of trigonometric identities, followed by simplification. It takes a parameter *rule* which specifies the name of the rule used. For example, applying with *rule* <sup>=</sup> TR5 on 2−2 sin<sup>2</sup> <sup>x</sup> yields 2 cos<sup>2</sup> x.

#### **3.3 Substitution**

Substitution makes use of the following theorem known from first-year calculus:

$$\int\_{a}^{b} f(g(x))g'(x) \, dx = \int\_{g(a)}^{g(b)} f(u) \, du.$$

There are two possible directions for applying the theorem, corresponding to two rules **Substitution I** and **Substitution II**.

**Forward substitution.** The rule **Substitution I** assumes the integral is in the form f(g(x))g (x). Typically in informal writing, only g(x) is provided, and f(x) is found by a sometimes magical process. To see the possible complexity involved, consider the integral

$$\int\_{\frac{3}{4}}^{1} \frac{1}{\sqrt{1-x}-1} \, dx$$

The required substitution is <sup>u</sup> <sup>=</sup> <sup>√</sup><sup>1</sup> <sup>−</sup> <sup>x</sup>. The usual explanation continues as follows. Compute du <sup>=</sup> <sup>−</sup><sup>1</sup> <sup>2</sup> (1 <sup>−</sup> <sup>x</sup>)−1/<sup>2</sup> dx <sup>=</sup> <sup>−</sup><sup>1</sup> <sup>2</sup>u−<sup>1</sup> dx. So dx <sup>=</sup> <sup>−</sup>2<sup>u</sup> · du. The values of u at the boundary points are <sup>1</sup> <sup>2</sup> and 0. So the integral can be rewritten as - 0 <sup>1</sup>/<sup>2</sup> −2u/(u − 1) du = - 1/2 <sup>0</sup> 2u/(u − 1) du.

Heurstic methods are needed for finding a suitable function f. Hence, we require the **Substitution I** rule to specify both f and g as parameters. The rule checks that f(g(x))g (x) and the original integrand become the same after simplification. We also restrict g to be monotonic (equivalently g (x) ≥ 0 or g (x) <sup>≤</sup> 0 in the open interval (a, b))<sup>1</sup>. For example, the previous substitution is given by <sup>f</sup>(u)=2u/(<sup>u</sup> <sup>−</sup> 1) and <sup>g</sup>(x) = <sup>√</sup><sup>1</sup> <sup>−</sup> <sup>x</sup>.

**Backward substitution.** The rule **Substitution II** applies substitution in the other direction. In informal writing, it is usually expressed as substituting x by some expression g(t). Then f is the original integrand, but the values of a and b need to be found by the reader. Our rule requires specifying a and b so that g(a) and g(b) equals the original limits of integration, and g is monotonic in the range (a, b). For example, the step


is represented as g = sin(t), a = 0 and b = π/2.

#### **3.4 Integration by Parts**

The **Integration by parts** rule applies the theorem

$$\int\_{a}^{b} u(x)v'(x) \, dx = u(x)v(x)|\_{a}^{b} - \int\_{a}^{b} u'(x)v(x) \, dx$$

Typically in informal writing, both u and v are provided. These are recorded as parameters of the rule. The rule checks that f · dx = u · dv, where f is the original integrand. For example, the step

$$\int\_{-1}^{2} xe^x \, dx = xe^x \vert\_{-1}^2 - \int\_{-1}^{2} e^x \, dx$$

is represented as u = x and v = e<sup>x</sup>.

<sup>1</sup> It is possible to relax this assumption, but the process for reconstructing the proof would be more involved.

#### **3.5 Rewriting**

The **Rewrite** rule provides more flexibility for rewriting than simplification. It allows rewriting an expression to any equivalent form as the preparation for applying other rules. The rule takes a parameter *rhs* specifying the intended right side of the rewrite, and another expression *denom*, defaulting to 1. The rule checks that *denom* is nonzero in the domain of integration, and the original expression and *rhs* have the same simplification after multiplying by *denom*.

The presence of *denom* means polynomial division and partial fraction decomposition can be specified. For example, when integrating x<sup>3</sup>/(x<sup>2</sup> + 1), the first step is to divide the numerator by the denominator, yielding <sup>x</sup>−x/(x<sup>2</sup> + 1). Simplification as we have implemented is not strongly enough to show their equivalence. However, after multiplying both sides by *denom* = x<sup>2</sup> + 1, the expressions <sup>x</sup><sup>3</sup> and <sup>x</sup>(x<sup>2</sup> + 1) <sup>−</sup> <sup>x</sup> become the same after simplification.

## **3.6 Splitting an Integral**

Sometimes it is necessary to split the domain of integration into two or more parts. This is needed to deal with absolute values, and non-monotonic functions g in a substitution. The rule **Split region** takes a parameter c satisfying a ≤ c ≤ b, and split the integral b <sup>a</sup> f(x) dx into c <sup>a</sup> f(x) dx+ b <sup>c</sup> f(x) dx. For example, when integrating - 1 −1 √ x<sup>2</sup> dx (the example from the introduction), the first step is to split with c = 0, resulting in - 0 −1 √ x<sup>2</sup> dx + - 1 0 √ x<sup>2</sup> dx, which can then be simplified to - 0 <sup>−</sup><sup>1</sup> <sup>−</sup>x dx <sup>+</sup> - 1 <sup>0</sup> x dx.

#### **3.7 Solving Equations**

One particularly interesting technique for integration involves solving for the value of the integral in an equation<sup>2</sup>. If an integral I can be written in the form X − cI, where X is any expression (containing no or simpler integrals), and c is a constant not equal to −1, then we can solve the equation I = X −cI to obtain I = X/(c+ 1). Common uses of this technique include integrating expressions of the form eax sin bx and eax cos bx (apply integration by parts twice, then solve equation). The rule **Solve equation** is applied only to the whole expression, and takes two parameters: the index *id* of a previous step and a coefficient *coeff*. Let I be the integral before step *id*. The rule adds *coeff* · I to the current expression, then divide by *coeff* + 1 and simplify. For example, in the evaluation of π/2 <sup>0</sup> <sup>e</sup><sup>2</sup><sup>x</sup> cos x dx, after some steps we get <sup>−</sup>2 + <sup>e</sup><sup>π</sup> <sup>−</sup> <sup>4</sup> π/2 <sup>0</sup> e<sup>2</sup><sup>x</sup> cos x dx. Then, applying **Solve equation** with *id* = 1 and *coeff* = 4 yields the answer <sup>1</sup> <sup>5</sup> (−2+e<sup>π</sup>).

# **4 User Interface**

Above the level of representation of a computation, the graphical user interface helps the user to specify a computation in several ways. Compared to editing a computation directly, the user interface provides the following conveniences:

<sup>2</sup> This is valid as long as the integral exists. In our setting this holds as long as the integrand is continuous.


In the remainder of this section, we describe the last two functionalities in more detail. A screenshot of the user interface is shown in Figure 1.


**Fig. 1.** Screenshot of the user interface, showing the computation of Example 2 in Section 6.

#### **4.1 Substitution**

As discussed in Section 3.3, the **Substitution I** rule requires both f and g as parameters, while typically only g is specified in informal arguments. Finding the function f can be a nontrivial process. We try two heuristic methods for finding f. First, if the substitution u = g(x) can be solved for x, yielding a function h such that x = h(u), then f can be found by dividing the integrand by g (x), then substituting h(u) for x and simplify. Both solving and simplification can be done without checking well-definedness of intermediate expressions, since in the end one only need f(g(x))g (x) to equal the integrand. For the implementation, we use SymPy's solve function to attempt to find h. The second heuristic simply replaces all expressions equal to g(x) by u, then hope that all remaining occurrences of x is in a single g (x) in the numerator. Note that the user can always first rewrite the expression into a form where the second heuristic can be applied.

#### **4.2 Rational Functions**

Polynomial division or partial fraction decomposition is a common first step for integrating rational functions. From the user interface, the user can invoke these actions. Then SymPy's apart method is used to obtain the results, For example, starting from the integral - 1/2 1/3 x <sup>1</sup>−x<sup>4</sup> dx, the user may choose partial fraction decomposition from the menu, which turns the integral into - 1/2 1/3 x 2(x2+1) <sup>−</sup> <sup>1</sup> 4(x+1) − 1 4(x−1) dx. The **Rewrite** rule with appropriate *denom* parameter is generated from this step.

## **4.3 Trigonometric Identities**

For the application of trigonometric identities, the user does not need to remember names of any rules in Fu's method. Instead, the user selects a subexpression to rewrite. Then, each of Fu's rules are applied in turn using SymPy. In case the application of any rule modifies the expression, the new expression is displayed, and the user can select from the displayed options. The selected action is then recorded with the corresponding name.

# **4.4 Slagle's Method**

We implement a heuristic integration method due to Slagle [45]. There are two main reasons why we choose Slagle's method. First, it is simple but effective for college-level problems. Second, it can output human-readable reasoning steps. This method maintains a search tree consisting of AND-nodes and OR-nodes. Each node contains an integral, with the root containing the original integral. An AND-node specifies that the integral at the node would be solved if each of its child nodes are solved. An OR-node specifies that the integral at the node would be solved if one of its child nodes is solved. The method iteratively expands the tree using a list of *algorithmic* and *heuristic* rules. Algorithmic rules involve basic normalization operations such as simplification and polynomial division, they are always applied to each node. In contrast, heuristic rules are more exploratory, such as guessing potential expressions for substitution, and count as one step in the search.

Our implementation is mostly faithful to the original presentation [45], with some modifications to fit better with our framework. The output of Slagle's method (if successful) is a list of applications of algorithmic and heuristic rules. Each rule can then be converted to one or more computation steps described in Section 3.

# **5 Proof Translation**

We now describe the process for translating a computation to a proof in higherorder logic. This requires sufficiently strong proof automation for verifying the application of each integration rule. The main components of the automation include showing two expressions are equal by simplification, inequality checking, and showing continuity, differentiability, and integrability of functions. The proof automation is implemented in Python based on HolPy. However, it should be possible to implement it in other proof assistants, and one aim of this section is to provide details to facilitate this process.

#### **5.1 Introduction to HolPy**

HolPy [49] is a new system for interactive theorem proving implemented in Python. Like Isabelle [32], HOL Light [21], and HOL4 [1], it uses higher-order logic as the logical foundation. The design of HolPy centers around explicit proof terms that can be generated and checked as Python objects, and written to a file in JSON format. Macros are used pervasively to control the size of proof terms. An API for proof automation facilitates implementation of procedures generating proof terms, in a manner similar to writing proof automation in the ML family of languages, but in the setting of an imperative programming language.

#### **5.2 Background Library**

For the background library in analysis, we ported statements of over a thousand theorems from HOL Light, of which about 40% are proved using the point-andclick based user interface [49]. However, major parts of the theory are yet to be formalized, including the construction of real numbers, the gauge integral, and the fundamental theorem of calculus. At present, the statements of the theorems need to be trusted. Finishing the formalization of the analysis library is planned as future work.

#### **5.3 Structure of Proof Automation**

The procedure for translating a computation is as follows. For each step in the computation, all expressions involved are first translated into terms in higherorder logic. Depending on the rule used, the automation applies the appropriate conversion to the input term, with the parameters of the rule serving as additional arguments to the conversion. Next, the automation attempts to show the equality between the result of the conversion and the expected output of the step by simplifying both sides. Hence, there does not need to be perfect agreement in the expected output and what is computed by proof automation. The translation is successful as long as proof automation is able to show their equivalence. In this way, we allow additional flexibility in the implementations.

We now discuss the overall structure of proof automation, which bears some similarity to the structure of auto and simp tactics in Isabelle [48]. We maintain two tables: a table of proof rules and a table of simplification rules. Each table is indexed by the head of the predicate or term the rule expects. There may be multiple rules associated to the same head term.

**–** A *prove* rule for a predicate p takes as input a goal whose head is p and a list of assumptions, and attempts to prove the goal. A simple way to specify a prove rule is from a list of theorems whose conclusion matches the given predicate. The corresponding prove rule attempts to apply each of the theorems in order. In case a theorem has assumptions, it recursively applies the overall prove procedure (described below) to discharge each assumption. **–** A *simplification* rule for a function f takes as input a term whose head is f and a list of assumptions, and computes the simplification of the term under these assumptions. A simple way to specify a simplification rule is from a list of theorems whose conclusion is an equality, where the left side has head f. The corresponding simplification rule attempts to rewrite using each of the equalities in order. Assumptions in the theorem are discharged by recursive calls to prove as in the previous case.

The overall procedure is defined as a mutual recursion between two functions prove and norm. The norm function receives a term and a list of assumptions as input. It first recursively applies itself to the subterms of the term. Next, it looks for simplification rules associated to the head of the term and applies them in turn. If the head changes, the process is repeated. Note the prove function may be called to discharge assumptions of rewrite rules. This continues until the term is not changed by the simplification rules. The prove function takes a goal and a list of assumptions as input. It first simplifies the goal, then look for prove rules associated to the head term and applies each of them in turn. The case where the goal is an equality reduces to simplifying both sides and then comparing whether they are the same.

#### **5.4 Inequality Checking**

A major task of proof automation is checking inequalities in one variable x constrained to lie in an interval [a, b] or (a, b). For example, if one wishes to simplify f(x)<sup>2</sup> to f(x) in the integrand, where the integral is from a to b, one needs to check f(x) ≥ 0 in the open interval (a, b). Here f may involve the usual arithmetic operations, as well as logarithm, exponential, and trigonometric functions.

The general problem of inequality checking is undecidable when special functions are involved. Hence, we can only hope for methods that can solve most of the inequality goals that appear in practice. There are many heuristic methods [7] as well as decision procedures for inequalities. For our purposes, we found the following, which can be considered as a simplified version of interval arithmetic, to be both simple and effective: starting from the assumption that x lies in a certain interval, iteratively deduce the intervals constraining each of the subterms in the expression. The derivation for each subterm depends on the head of the subterm. Of course, this method is incomplete as it tends to over-approximate the intervals of terms formed from binary operators. Implementation of more advanced inequality checking methods is a goal for the future.

#### **5.5 Simplification**

Simplification for arithmetic operations follows the same principle as in Section 3.1: expand the expression into polynomial form, but do not expand powers. We also do not reduce rational functions. This is similar to the normalization of polynomials in other implementations of proof automation [7].

More precisely, define a monomial to be a term of the form <sup>c</sup>·(ap<sup>1</sup> <sup>1</sup> ap<sup>2</sup> <sup>2</sup> ··· <sup>a</sup>p<sup>k</sup> <sup>k</sup> ), where c is a rational number, and each a<sup>i</sup> is either a prime number or a term whose head is not an arithmetic operator. If a<sup>i</sup> is a prime number, then the corresponding p<sup>i</sup> must be either non-constant or a rational number between 0 and 1 exclusive. The ai's are distinct and sorted in a pre-determined order. A rational number is a special case of a monomial, with k = 0. We call c the coefficient of a monomial and ap<sup>1</sup> <sup>1</sup> ap<sup>2</sup> <sup>2</sup> ··· <sup>a</sup>p<sup>k</sup> <sup>k</sup> its body. A polynomial is a sum of monomials, whose bodies are all distinct and in sorted order. It is clear that any expression can be simplified into this form. For example, <sup>√</sup><sup>6</sup> <sup>√</sup>2(<sup>x</sup> + 3<sup>2</sup>/<sup>3</sup>) is simplified to

$$6^{1/2}2^{1/2}x + 6^{1/2}2^{1/2}3^{2/3} = 2^{1/2}3^{1/2}2^{1/2}x + 2^{1/2}3^{1/2}2^{1/2}3^{2/3} = 2 \cdot 3^{1/2}x + 6 \cdot 3^{1/6}$$

Simplification of polynomials is implemented in the simplification rules for +, <sup>×</sup> and power. <sup>a</sup> <sup>−</sup> <sup>b</sup> and a/b are simply reduced to <sup>a</sup> + (−1) · <sup>b</sup> and <sup>a</sup> · <sup>b</sup>−<sup>1</sup>, respectively.

For logarithms and exponentials, we apply the standard simplification rules log 1 = 0, log(e<sup>x</sup>) = <sup>x</sup> and <sup>e</sup><sup>0</sup> = 1,x> <sup>0</sup> −→ <sup>e</sup>log <sup>x</sup> <sup>=</sup> <sup>x</sup>. Simplifying trigonometric functions applied to special values is trickier, as we may need to add or subtract multiples of π. For example, cos <sup>7</sup><sup>π</sup> <sup>3</sup> is first rewritten to cos <sup>π</sup> <sup>3</sup> and then to <sup>1</sup> 2 .

When simplifying an integral over the closed interval [a, b], we apply the following congruence rule:

$$\forall x \in (a, b). \ f(x) = g(x) \longrightarrow \int\_{a}^{b} f(x) \, dx = \int\_{a}^{b} g(x) \, dx.$$
 
$$\text{to } \text{co}$$
 
$$\text{to } \text{co}$$

This allows us to assume x ∈ (a, b) when simplifying f(x).

#### **5.6 Applying Theorems**

For proving continuity and differentiability, we set up the corresponding prove rules using lists of introduction rules. Some of these rules require assumptions that are discharged recursively. For example, the introduction rule for division is as follows:

> continuous on S f, continuous on S g, <sup>∀</sup><sup>x</sup> <sup>∈</sup> S. g(x) = 0 −→ continuous on S (λx. f(x)/g(x))

Application of this rule involves recursively proving the three assumptions, including the use of inequality checking from Section 5.4.

Substitution and integration by parts are implemented by applying the corresponding theorems. This is simple because the parameters of the rule already contain instantiations for all function variables.

#### **6 Evaluation and Examples**

We evaluated our prototype implementation<sup>3</sup> on problems taken from exam preparation books (Tongji), online problem lists by D. Kouba [27] (Kouba) and

<sup>3</sup> The code and examples are available online at https://github.com/bzhan/ holpy.


the MIT Integration Bee [2] (MIT). We also compared our results with Maple and WolframAlpha. Statistics from the evaluation are shown in Table 1.

**Table 1.** Statistics on the problem lists. "Solved" indicates the number of problems for which proofs can be successfully reconstructed from human-provided computations. "Slagle" indicates the number of problems that can be solved by Slagle's method, with successful proof reconstruction. "Maple" represents the number of problems solved by Maple. "WolframAlpha" represents the number of problems which WolframAlpha can give step-by-step solutions without exceeding its time limit.

The Kouba problem lists are divided into different categories based on techniques used. With human-provided computation steps, we can reconstruct proofs for all of the Tongji problems, most of the problems in D. Kouba's list, while problems from the MIT Integration Bee are more challenging (with the later years increasing in difficulty). Most of the failures are due to unable to show equality after simplification, and during inequality checking. Some are due to unsupported functions.

We show two interesting examples from our case studies. SymPy (version 1.5) returns a wrong answer on the first example and times out on the second. The second example takes a long time even for Mathematica, and cannot be solved by its online version WolframAlpha. These examples demonstrate that our system avoids the common errors, and since the user can guide the computation stepby-step, is also able to verify integrals that are difficult even for sophisticated computer algebra systems.

The first example (Tongji, #27) demonstrates the splitting of domain of integration, as well as use of trigonometric identities. The integral is

$$\int\_0^\pi \sqrt{1 + \cos 2x} \, dx$$

This integral is incorrectly evaluated by SymPy as 0. It is correctly evaluated by Mathematica almost instantly.

The evaluation begins with application of trigonometric identities, rewriting the integrand to 1 + cos<sup>2</sup> <sup>x</sup> <sup>−</sup> sin<sup>2</sup> <sup>x</sup> and then to <sup>√</sup> 2 cos<sup>2</sup> x. For this, the user simply needs to select cos 2x and then sin<sup>2</sup> x, and choose the desired rewrite targets. The resulting situation is similar to the example given in the introduction. It is then necessary to split the domain of integration where cos x = 0. The system is able to automatically determine x = <sup>π</sup> <sup>2</sup> . The full computation is:

$$\begin{split} I &= \int\_{0}^{\pi} \sqrt{1 + \cos^{2} x - \sin^{2} x} \, dx \quad \text{(Rewrite trig. rule TR11)}\\ &= \int\_{0}^{\pi} \sqrt{2 \cos^{2} x} \, dx = \sqrt{2} \int\_{0}^{\pi} |\cos x| \, dx \quad \text{(Rewrite trig. rule TR5, Simplification)} \\ &= \sqrt{2} \left( \int\_{0}^{\frac{\pi}{2}} |\cos x| \, dx + \int\_{\frac{\pi}{2}}^{\pi} |\cos x| \, dx \right) \quad \text{(Split region with } c = \frac{\pi}{2} )\\ &= 2\sqrt{2} \quad \text{(Elim absolute value, Simplification)} \end{split}$$

The second example comes from MIT Integration Bee 2019, problem #14:

$$I = \int\_0^{\pi/100} \frac{\sin(20x) + \sin(19x)}{\cos(20x) + \cos(19x)} \, dx$$

It is simple if one notices to apply the sum-to-product identity first, but almost impossible otherwise. WolframAlpha fails to find the symbolic answer. Using Mathematica offline, it takes about 15 seconds to return an answer, which is however much more complicated than necessary.

The full computation using our tool is:

$$\begin{split} I &= \int\_0^{\pi/100} \frac{\sin\left(\frac{39}{2}x\right)}{\cos\left(\frac{39}{2}x\right)} dx \quad \text{(Rewrite trigonometric, rule TR9)}\\ &= \int\_{\cos\left(\frac{39\pi}{200}\right)}^1 \frac{2}{39} \frac{1}{t} dt \quad \text{(Substitution I with } g = \cos\left(\frac{39}{2}x\right))\\ &= -\frac{2}{39} \log\left(\cos\frac{39\pi}{200}\right) \quad \text{(Simplification)}. \end{split}$$

#### **7 Conclusion**

In this paper, we proposed a framework for verifying symbolic computation of definite integrals, where the user can perform computations in an interface familiar from computer algebra systems, but with results verified by automatic translation to proofs in higher-order logic. The design of the framework follows a layered approach, with each layer focusing on a different aspect of the problem: methods for solving integrals, computer algebra, and proof reconstruction. We implemented a prototype system based on HolPy, and evaluated it on a test suite consisting of publicly available problem lists at the undergraduate level, showing its effectiveness on a large majority of cases.

One immediate piece of future work is to secure the foundation of the higherorder logic proof, by formalizing the proofs of the required theorems. Another gap is the arithmetic computation and comparison of real constants, which, in the case of comparisons, would require approximation techniques [10].

Our prototype implementation focuses on definite integrals of one-variable functions. However, the idea can be applied more generally, by suitably extending the language of integration rules. For applications in the engineering domain, some extensions that would be of high value include linear algebra, improper integrals (including Laplace and Fourier transforms), and vector calculus.

# **References**


Symposium on Symbolic and Algebraic Computation, ISSAC 2019, Beijing, China, July 15-18, 2019. pp. 147–154. ACM (2019)


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/ by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **ATP and AI**

# **Confidences for Commonsense Reasoning**

Tanel Tammet1(B) , Dirk Draheim<sup>2</sup> , and Priit J¨arv<sup>1</sup>

<sup>1</sup> Applied Artificial Intelligence Group, Tallinn University of Technology, Tallinn, Estonia {tanel.tammet,priit.jarv1}@taltech.ee <sup>2</sup> Information Systems Group, Tallinn University of Technology, Tallinn, Estonia dirk.draheim@taltech.ee

**Abstract.** Commonsense reasoning has long been considered one of the holy grails of artificial intelligence. Our goal is to develop a logic-based component for hybrid – machine learning plus logic – commonsense question answering systems. A critical feature for the component is estimating the confidence in the statements derived from knowledge bases containing uncertain contrary and supporting evidence obtained from different sources. Instead of computing exact probabilities or designing a new calculus we focus on extending the methods and algorithms used by the existing automated reasoners for full classical first-order logic. The paper presents the CONFER framework and implementation for confidence estimation of derived answers.

## **1 Introduction**

The mainstream approaches for "commonsense reasoning" (CSR) before this century focused on rule based reasoning and building suitable logical systems. During the last ten years the focus has switched to machine learning and neural networks. Both of these approaches appear to be limited. A promising approach to practical question answering is building hybrid systems like Watson [17] which complement the current machine learning systems for natural language with logic-based reasoning systems specialized for CSR. In particular, hybrid systems have a good potential for progress towards explainable A.I. See Marcus [26] for an overview of the current work in the area. Our goal is to build upon the existing theory and reasoning systems for first order logic (FOL) to develop a framework and practical systems using FOL reasoners which could be incorporated into a hybrid system containing both machine learning components and rule-based reasoning components. This approach will also provide step-by-step proofs for the answers found, useful for building explainable systems.

We will present the design and implementation of the CONFER framework for extending existing automated reasoning systems with confidence calculation capabilities. We will not focus on other, arguably even more critical issues for CSR and question answering, like handling natural language itself, dialogues,

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5 29 507–524, 2021.

rules with exceptions and default logic [31] or circumscription, knowledge representation for space/time, epistemic reasoning, using context, building and collecting suitable rules, machine learning etc.

The specific CSR task targeted by the current paper is *question answering*: given either a knowledge base of facts and rules or a large corpora of texts (or both), plus optionally a situation description (assumptions) for the questions, answer questions posed either in logic or natural language.

Historically, the longest-going CSR project has been the logic-based CYC project [25], already in 1985 stating the focus on CSR. Despite several successes, the approach taken in the CYC project has often been viewed as problematic ([8], [10]) and has been repeatedly used as an argument against logic-based methods in CSR. Beltagy et al [5] experiment with Markov Logic Network for combining logical and distributional representations of natural language meaning. Domingos et al note in [13] that the CYC project has used Markov Logic for making a part of their knowledge base probabilistic. Khot et al [24] experiment with Markov Logic Networks for NLP question answering. Furbach et al [20] describe research and experiments with a system for natural language question answering, converting natural language sentences to logic and then performing proof search, using different existing FOL knowledge bases. The authors note a number of difficulties, with the most crucial being the lack of sufficiently rich FOL knowledge bases. The closest current approach to ours appears to be the Braid system [23] built by the team previously involved with the Watson system.

#### **2 Interpretation and Encoding of Uncertainty**

Reasoning under uncertainty has been thoroughly investigated for at least a century, leading to a proliferation of different theories and mechanisms. A classic example is the MYCIN system [6]. For newer approaches see, for example, [32] and [9]. Each of these is well suited for certain kinds of problems and ill-suited for other kinds. Underlying this is the philosophical complexity of interpreting probability: see [22] for an overview, see also [16], pp. 5-7.

Most of the previous work on combining logic with uncertainty has targeted propositional logic. First order logic is then handled by creating a finite set of weighted ground instances of formulas. This is the approach taken, for example, by the probabilistic logic programming systems ProbLog2 [18], PRISM [34] and the implementation of Markov Logic Networks [12,11] by the Alchemy 2 system [1]. These systems pose different restrictions to the FOL formulas and while wellsuited for small domains in cases the restrictions can be followed, the approach becomes unfeasible if the domain is large or formulas complex. For example, neither the ProbLog 2 nor Alchemy 2 implementations manage to answer queries like 1.0::p(a). 1.0::p(i(a,b)). 1.0::p(Y) :- p(X), p(i(X,Y)).

query(p(b)). The implementation of ProbLog2 [29] fails, presumably due to infinite recursion in searching for possible groundings for the variables, while Alchemy 2 does not allow function terms in grounded facts.

Previous approaches to full first order logic tend to fall into one of the three camps: either using fuzzy logic [41], representing probabilities as an interval (see [15] for the axiomatic derivation of Dempster-Schafer rules) or interpreting probabilities via many worlds similarly to modalities [4].

For the sake of this work, we largely follow the *subjective interpretation* of probability as a degree of belief, originating from Ramsey and De Finetti. We use the word *confidence* to denote our rough adherence to this interpretation. We avoid using complex measures such as intervals, distributions or fuzzy functions.

In the context of question answering we assume that confidences are typically used for sorting a list of candidate answers by their calculated confidence and optionally applying a filter to eliminate answers with a confidence under a certain threshold. Answers provided may be also annotated with a confidence number. If we are given or can calculate several different confidences for the same answer, we always prefer the higher confidence. The question of calculating a correct probability rarely arises or is considered to be unfeasible.

#### **2.1 Sources, Representation and Meaning of Statements, Confidences and Dependencies**

We assume that the confidence in a fact or rule in our common sense knowledge base (KB in the following) typically arises from a large number of human users via crowd-sourcing like in ConceptNet [35,7], NLP-analyzed scraped text from the web like NELL [27], and/or combining different knowledge bases with weights like in [14] and [7] or assigned to the equivalence of name pairs in the vocabulary like in [28] and [19]. There is recent progress towards making knowledge bases for common sense reasoning where the relation strengths (typicality, saliency) have been empirically evaluated [7,33].

To each FOL statement S we will assign both a confidence c and a set L of unique identifiers of (non-derived) input statements used for deriving this statement: a triple -S, c, L. Lists of such triples are then treated as sets. The dependency lists L are used in the formula estimating the cumulated confidence. The algorithm for calculating confidences c for derivations will be presented later.

To be more exact, we will not allow assigning confidences to arbitrary statements. Instead, we will assume that the FOL statements are converted to a conjunctive normal form: a conjunction of Skolemized disjunctions, where each disjunction only consists of atomic statements (a predicate applied to arguments) or negations of atomic statements. Such disjunctions are called *clauses*. We will not allow nested triples, i.e. S is always a pure FOL clause not containing any confidence or dependency information usable by the presented algorithms. However, for each single FOL clause S there may be many different derivable triples -S, c, L for different <sup>c</sup> and <sup>L</sup>, stemming from different derivation trees of <sup>S</sup>. They are assumed to be independent statements, possibly allowing the calculation of the cumulative confidence for S higher than max(c, c ) where c and c come from different triples.

A KB may contain logical contradictions and identical FOL clauses with different confidences given by different sources. For example, the following is a logically contradictory KB containing several copies of the same clause with different confidences. The CONFER algorithm presented later gives us the confidence of bird(a):0.682 from this KB:

bird(X), <sup>0</sup>.1, L1,bird(a), <sup>0</sup>.8, L2,bird(a), <sup>0</sup>.9, L3,-<sup>¬</sup>bird(a), <sup>0</sup>.3, L4

We interpret the confidence as estimating the lower limit of the probability of a statement, i.e., -S, c, L is interpreted as "statements <sup>L</sup> support the claim that probability(S) <sup>≥</sup> <sup>c</sup>". Thus two different confidence statements for the same clause are never contradictory, even if given by the same source.

# **3 The CONFER Extension Framework for CSR**

In the following we will present the CONFER framework of extensions to the mainstream resolution-based search methods. We expect that the same framework can be adapted to search methods different from resolution, i.e. the specific aspects of resolution are not relevant for the main principles of the approach.

The intuition behind CONFER is preserving first order classical logic (FOL) intact as an underlying machinery for derivations in CSR. The core methods of automated reasoning used by most of the high-performance automated reasoning systems remain usable as core methods for CSR. Essentially, FOL with the resolution method produces all combinations of derivable sentences (modulo simplifications like subsumption) which could lead to a proof. The main difference between strict FOL and CONFER extensions is in the handling of constructed proof trees: the outcome of a CONFER reasoner is a set of combined FOL proofs with the confidence measures added.

Importantly, the framework does not generally calculate the exact maximal confidence for derived statements, since this is, in nontrivial cases, either impossible or unfeasible. Our goal is to give a practically useful estimation of the maximal confidence without causing a large overhead on the FOL proof search and avoiding combinatorial explosion while calculating the confidences.

#### **3.1 Resolution Method**

In the following we will assume that the underlying first order reasoner uses the resolution method, see [3] for details. The rest of the paper assumes familiarity with the basic concepts, terminology and algorithms of the resolution method.

#### **3.2 Queries and Answers**

We assume the question posed is in one of two forms: *(1)* Is the statement Q true? *(2)* Find values V for existentially bound variables in Q so that Q is true. For simplicity's sake we will assume that the statement Q is in the prefix form, i.e., no quantifiers occur in the scope of other logical connectives.

In the second case, it could be that several different value vectors can be assigned to the variables, essentially giving different answers. We also note that an answer could be a disjunction, giving possible options instead of a single definite answer. However, as shown in [38], in case a single definite answer exists, it will be derived eventually.

A widely used machinery in resolution-based theorem provers for extracting values of existentially bound variables in Q is to use a special *answer predicate*, converting a question statement Q to a formula

$$\exists X\_1, \dots, \exists X\_n (Q(X\_1, \dots, X\_n) \& \neg answer(X\_1, \dots, X\_n))$$

for existentially quantified variables in Q [21]. Whenever a clause is derived which consists of only answer predicates, it is treated as a contradiction (essentially, answer) and the arguments of the answer predicate are returned as the values looked for. A common convention is to call such clauses *answer clauses*. We will require that the proof search does not stop whenever an answer clause is found, but will continue to look for new answer clauses until a predetermined time limit is reached. See [37] for a framework of extracting multiple answers.

We also assume that queries take a general form (*KB*&A) <sup>⇒</sup> <sup>Q</sup> where *KB* is a commonsense knowledge base, A is an optional set of precondition statements for this particular question and Q is a question statement.

Since we assume the use of the resolution method for proof search, the whole general query form is negated and converted to clauses, i.e., disjunctions of literals (positive or negative atoms). We will call the clauses stemming from the question statement *question clauses*.

#### **3.3 Top Level of the Algorithm**

Calculating confidences for question answering requires, at least, the ability to calculate (a) the decreasing confidence of a conjunction of clauses as performed by the resolution and paramodulation rule, (b) the increasing confidence of a disjunction of clauses for cumulating evidence, (c) the decreasing confidence of considering negative evidence for a clause.

While the systems based on, say, Bayes networks and Markov logic, perform these operations in a combined manner, our framework will split the whole search into separate phases for each. First we perform a modified resolution search we call *c-resolution* calculating the decreasing confidence and potentially giving a large number of different answers and proofs. Next we will combine the different proofs using the cumulation operation. Finally we will collect negative evidence for all the answers obtained so far, separately for each individual answer. The latter search is also split into the c-resolution phase and the cumulating phase. Since we assume the use of full FOL, the c-resolution search will not necessarily terminate, thus we will use a time limit. The top level of the algorithm is presented in the following section as Algorithm 1.

#### **Algorithm 1** CONFER algorithm

**Input**: Common sense knowledge base KB, question Q, time limit t.

**Output**: Set of answers R with attached confidences.


#### **3.4 C-Resolution**

The core part of the algorithm described above is c-resolution: a relatively simple modification of the resolution method calculating and keeping track of the (multiplied) confidences of premisses of each step along with the union of their dependencies.

**Definition 1 (C-Resolution).** *A modification of the resolution method computing an ever-increasing set of different proofs for different answers (substitutions to the question clauses) while employing the relevance filter (definition 2), performing basic confidence calculation for resolution steps (definition 3), assigning the union of the dependency lists of premisses to each derived clause, restricting subsumption to c-subsumption (definition 5) and restricting simplification steps according to c-subsumption.*

**Inconsistencies.** A KB with a nontrivial structure may contain inconsistencies in the sense that a contradiction can be derived from the KB. Looking at existing KBs mentioned earlier, we observe that they either are already inconsistent (for example, the largest FOL version of OpenCyc [30] in TPTP [40] is inconsistent) or would become inconsistent in case intuitively valid inequalities are added, for example, inequalities of classes such as "a cat is not a dog", "a male is not a female" or default rules such as "birds can fly", "dead birds cannot fly", "penguins cannot fly". We note that several large existing KBs do not contain such inequalities explicitly, although they are necessary for nontrivial question answering under the open-world assumption.

Since classical FOL allows to derive anything from a contradiction, it is clearly unsuitable for a large subset of KB-s. Two possible ways of overcoming this issue are: (a) using some version of relevance logic or other paraconsistent logics or (b) defining a filter for eliminating irrelevant classical proofs. We argue that despite a lot of theoretical work in the area, only little work has been done in automated proving for relevance logic, thus using it directly is likely to create significant complexities. Instead, we introduce a simple relevance filter:

**Definition 2 (Relevance Filter).** *Each resolution derivation of a contradiction not containing any answer clauses is discarded.*

Since a standard resolution derivation of a contradiction does not lead to any further derivations, this filter is completeness-preserving in the sense that all resolution derivations containing an answer clause are still found.

**Confidences of Derived Clauses.** We take the approach of (a) providing a simple sensible baseline algorithm for calculating confidences of derived clauses, and (b) leaving open ways to modify this algorithm for specific cases as need arises. We will use a single rational number in the range 0...1 as a measure of a confidence of a clause, with 1 standing for perfect confidence and 0 standing for no information. Confidence of an atomic clause not holding is represented as a confidence of the negation of the clause.

As a baseline we use the standard approach of computing uncertainties of clauses derived from independent parent clauses A and B as:

$$P(A \land B) = P(A) \* P(B)$$

Notice that for dependent parent clauses this formula *under-estimates* the confidence of the result.

**Definition 3 (Basic Confidence Calculation for Resolution Steps).** *For binary resolution and paramodulation steps, the confidence of a result is obtained by multiplying the confidences of the premises. For the factorization step, the confidence of the result is the confidence of the premise, unchanged. Question clauses have a confidence 1.*

A simple example employing forward reasoning (concretely, *negative ordered resolution*):

```
0.8:: bird(tweety).
0.9:: bird(X) => canfly(X).
0.7:: canfly(X) => fast(X).
1.0:: fast(X) => answer(X).
```
leads to a sequential derivation of

```
0.72:: canfly(tweety).
0.504:: fast(tweety).
0.504:: answer(tweety).
```
Recall that the confidences are assumed to be lower bounds of probabilities. Notice that the possible dependence of the premises could be taken into account, as in the following section for cumulative evidence. This would result in higher confidence numbers for derivations with dependent premises. Consider the following example:

0.9:: bird(X) => canfly(X). 0.1:: -bird(X) => canfly(X).

Using the basic calculation step we can derive that anything can fly: 0.09:: canfly(X). However, since anything is either a bird or is not a bird, the confidence of canfly(X) should be at least 0.1, and possibly higher, depending on the ratio of birds to non-birds.

Generally, we can use the minimization operation leading to a higher confidence value than the multiplication of the confidences of premises in the following special case. The standard resolution inference rule used by a large class of automated reasoners is defined as

$$\frac{A\_1 \lor A\_2 \lor \dots \lor A\_n}{(A\_2 \lor \dots \lor A\_n \lor B\_2 \lor \dots \lor B\_m)\sigma}$$

where σ is the most general unifier of A<sup>1</sup> and B1. A clause A *subsumes* a clause B if the literals of Aδ are a subset of literals of B for some substitution δ.

**Definition 4 (Extended Confidence Calculation for Resolution Steps).** *If* (A<sup>2</sup> <sup>∨</sup> ... <sup>∨</sup> <sup>A</sup>n)<sup>σ</sup> *subsumes* (B<sup>2</sup> <sup>∨</sup> ... <sup>∨</sup> <sup>B</sup>m)<sup>σ</sup> *in the resolution inference defined above then the confidence of the result is the minimum of the confidences of premises.*

**C-Subsumption and Simplifications.** Since standard subsumption used by resolution provers to clean up search space may remove clauses with a higher confidence or fewer dependencies than the subsuming clause, it may cause the prover to lose derivations potentially leading to a higher confidence. Thus we use *c-subsumption* instead of the standard subsumption:

**Definition 5 (C-Subsumption).** *A triple* <sup>T</sup><sup>1</sup> <sup>=</sup> -<sup>A</sup>1, c1, L1 *consisting of a clause* A1*, confidence* c<sup>1</sup> *and a dependency list* L<sup>1</sup> c-subsumes *a triple* <sup>T</sup><sup>2</sup> <sup>=</sup> -<sup>A</sup>2, c2, L2 *if and only if* <sup>A</sup><sup>1</sup> *subsumes* <sup>A</sup>2*,* <sup>c</sup><sup>1</sup> <sup>≥</sup> <sup>c</sup><sup>2</sup> *and* <sup>L</sup><sup>1</sup> <sup>⊆</sup> <sup>L</sup>2*.*

We can prove the following lemma:

**Lemma 1 (C-Subsumption Preserves Completeness).** *When a c-resolution proof can be found without using subsumption, it can be also found with c-subsumption.*

The proof holds for strategies of resolution for which standard subsumption is complete for ordinary proof search without confidences.

We restrict the simplification operations like demodulation and subsuming resolution accordingly: a derivation step must keep the original premiss P if the result has a lower confidence or a longer list of dependencies than P.

#### **3.5 Cumulative Confidence**

We will now look at the situation with additional evidence for the derived answer. In our context, using additional evidence is possible if a clause C can be derived in different ways, giving two different derivations d<sup>1</sup> and d<sup>2</sup> with confidences c<sup>1</sup> and c2. In case the derivations d<sup>1</sup> and d<sup>2</sup> are independent, we could apply the standard formula

$$P(A \lor B) = P(A) + P(B) - P(A \land B)$$

to c<sup>1</sup> and c<sup>2</sup> to calculate the cumulative confidence for C.

What would it mean for derivations to be "independent"? In the context of commonsense reasoning we cannot expect to have an exact measure of independence. However, suppose the derivations d<sup>1</sup> and d<sup>2</sup> consist of exactly the same initial clauses, but used in a different order. In this case c<sup>1</sup> = c<sup>2</sup> and the cumulative confidence should intuitively be also just c1: no additional evidence is provided. On the other hand, in case that the non-question input clauses of d<sup>1</sup> are d<sup>2</sup> are mutually disjoint, then the derivations are also independent (assuming all the input clauses are mutually independent), and we should apply the previous rule for <sup>P</sup>(<sup>A</sup> <sup>∨</sup> <sup>B</sup>) for computing the cumulative confidence.

We will estimate the independence i of two derivations d<sup>1</sup> and d<sup>2</sup> simply as

$$1 - \frac{\text{number of shared input clauses of } d\_1 \text{ and } d\_2}{\text{total number of input clauses in } d\_1 \text{ and } d\_2} \tag{1}$$

Thus, if no clauses are shared between d<sup>1</sup> and d2, then i = 1 and if all the clauses are shared, then i = 0.

In addition, we also know that it is highly unlikely that all the input clauses are mutually independent. Again, lacking a realistic way to calculate the dependencies, we give a heuristic estimate h in the range 0...1 to the overall independence of the input clause set, where 1 stands for total independence and 0 for total dependence.

Finally, we will calculate the overall independence of two derivations d<sup>1</sup> and <sup>d</sup><sup>2</sup> as <sup>i</sup> <sup>∗</sup> <sup>h</sup>. Next, we will postulate a heuristic rule for the combination of these two independence measures as follows.

**Definition 6 (Confidence Calculation for Cumulative Evidence).** *Given two derivations* d<sup>1</sup> *and* d<sup>2</sup> *of the search result* C *with confidences* c<sup>1</sup> *and* c2*, calculate the updated confidence of* C *as*

$$\max(c\_1 + c\_2 \ast i \ast h, \ c\_1 \ast i \ast h + c\_2) - c\_1 \ast c\_2 \ast i \ast h$$

*where*


The formula satisfies the following *intuitive requirements for cumulative evidence*:


#### **3.6 Negative Evidence**

Recall the standard mechanism employed in FOL provers for finding concrete answers: transforming existentially quantified goal clauses to clauses containing a special answer predicate and treating clauses containing only answer predicates as actual answers to the question found.

Once negation is present, the reasoning system using the CONFER framework has to attempt to find both positive and negative evidence for any potential answer. This cannot be easily done in a single proof search run.

Observe that giving a general search question containing variables like bird(X) <sup>∨</sup> answer(X) may produce a different set of answers than the positive question <sup>¬</sup>bird(X) <sup>∨</sup> answer(X). Also observe that the potential set of answers may be huge for both positive and negative answers: in a large KB there may be millions of statements about birds and our reasoning system will be able to derive only a small fraction of potential answers in any given time slot. Thus, even if negative evidence is potentially derivable for some positive answer, the system is unlikely to find it.

A reasonable solution to this problem is to run the searches for negative evidence only for the concrete instances of positive answers found. More concretely, we conduct additional proof search for the negations of two types of questions <sup>Q</sup>: *(a)* If <sup>Q</sup> contains no existentially quantified variables, is the statement <sup>¬</sup><sup>Q</sup> true? *(b)* For all i vectors of values C1i, ..., Cni found for existentially bound variables <sup>X</sup>1, ...X<sup>n</sup> in <sup>Q</sup> making <sup>Q</sup> true, is <sup>¬</sup><sup>Q</sup> true when we substitute the values in C1i, ..., Cni for corresponding variables in Q?. The final confidence of an answer to Q is calculated by subtracting from the confidence of the positive answer the confidence of the answer to the corresponding negated instance of the question.

Using negative evidence may lead to unexpected results. Consider the following trivial example in the ProbLog syntax:

0.5::bird(a). 0.5::not bird(a). query(bird(a)).

CONFER gives us confidence 0, which we interpret as "no information", not as "false". However, ProbLog2 gives confidence 0.25, which is explained by one of the authors in private correspondence thus: an atom (head) is satisfied if any of the rules that make it true fire and none of the rules that make it false fire. In this example ProbLog2 gets 0.<sup>5</sup> <sup>∗</sup> (1 <sup>−</sup> <sup>0</sup>.5) = 0.25. On the other hand, the three different algorithms of the Alchemy 2 system – MC-SAT explained in [12], exact and approximate probabilistic theorem proving explained in [11] – give answers 0.015, 0 and 0.082, respectively. To be concrete, we are using the Alchemy 2 versions from [2]. For this and the following Alchemy 2 examples we prepared an MLN file with no weights and a training data file with some generated facts for each example. Then we ran the *learnwts* program with default parameters, which created the MLN file with weights for each example.

Next, consider a previous example augmented with the "birds fly" rule:

0.5::bird(a). 0.5::not bird(a). 0.9:: flies(X) :- bird(X). query(flies(\_)).

Here CONFER gives us 0.45, which is inconsistent with the result of the previous example. ProbLog2, on the other hand, gives 0.225, which is unintuitive, but consistent with the unintuitive result of ProbLog2 in the previous example. The three algorithms of Alchemy 2 mentioned above give us 0.047, 0 and 0.98. The issue arising in this example is similar to nonmonotonic reasoning like default logic: adding negative evidence to being a bird should block previously derivable facts. We know that since FOL is not decidable, such checks would make derivation steps generally not computable. As a final twist to the example we augment the ruleset by giving more details about the distribution:

```
0.5::bird(a). 0.5::not bird(a). 0.9:: flies(X) :- bird(X).
0.1:: not flies(X) :- bird(X).
%% 0.1:: flies(X) :- not bird(X). %% commented out
0.9:: not flies(X) :- not bird(X).
query(flies(_)).
```
Here CONFER gives us an acceptable 0.014 (positive evidence 0.490 and negative evidence 0.476), while ProbLog2 gives 0.2025. The results of Alchemy 2 are 0.047, 0 and 0.976. Adding the rule we have commented out makes CONFER to give -0.008 while ProbLog2 complains that the example is not acceptable. Alchemy 2 gives us 0.056, 0 and 0.509.

#### **4 Implementation and Experimental Results**

The first author has implemented the CONFER framework as an extended version of his high-performance open-source automated reasoning system gkc [39] for FOL, performing fairly well in the yearly CASC competition for automated reasoners [36], see http://www.tptp.org/CASC/. The implementation is written in C like gkc. The compiled executable can be downloaded from http://logictools.org/confer/ along with a number of examples.

Several algorithms, strategies and optimizations present in the gkc system are currently switched off, due to the need for additional modifications and testing. In particular, parallel processing is switched off, as well as the crucial algorithms for selecting a list of suitable search strategies and performing search by batches with iteratively increasing time limits.

Importantly, we have not yet implemented any specialized strategies for using the attached confidences and dependencies for directing and optimizing search. It is clear that the added information gives ample potential opportunities for directing the search.

We will give an overview of the experiments with the implementation in two sections. First we will look at the confidences calculated and compare these, where possible, with the values given by ProbLog2 and Alchemy 2. Next we will look at the performance of the system on nontrivial problems.

The inputs and outputs for the CONFER implementation and the systems compared to are given on the web page http://logictools.org/confer/. The set of examples given contains over 30 case studies and can be run using the commandline implementation provided on the same web page as a single executable file. The implementation is self-contained, not dependent on other systems or external libraries. It should run on any 64-bit Linux system.

#### **4.1 Comparing Confidences**

We will compare the confidences calculated by CONFER on small selected examples with these of ProbLog2 and Alchemy 2. The first two are presented in the ProbLog2 tutorial. When CONFER can perform neither cumulation nor collection of evidence, the values calculated are the same as of ProbLog2. The cumulation operation of CONFER produces, as expected, slightly different values than ProbLog2 or Alchemy 2. For the following examples the overall independence estimate h is assigned 1 (maximum). Since the principles of handling negative evidence are fundamentally different between the two systems, this operation causes the most significant changes. It is worth noticing that more often than not, the results of ProbLog2 and Alchemy 2 also differ.

First, a simple version of the well-known social networks of smokers example in the ProbLog syntax. CONFER uses a different syntax, but the clauses and confidences given are exactly the same. We have also built the corresponding data- and rulesets for Alchemy 2, which uses a fairly different input method than CONFER or ProbLog.

```
0.8::stress(ann). 0.4::stress(bob).
0.6::influences(ann,bob). 0.2::influences(bob,carl).
smokes(X) :- stress(X).
smokes(X) :- influences(Y,X), smokes(Y).
query(smokes(carl)).
```
For this example, ProbLog2 gives an answer 0.1376 and CONFER gives 0.1201, cumulating values 0.096 and 0.08. The three different algorithms of Alchemy 2 – MC-SAT inference (see [12]), exact and approximate lifted inference explained in [11] – give 0.135, 0 and 0.741, respectively. In the following tables we will refer to these three as *Alch i*, *Alch e* and *Alch a*. Removing the input clause 0.4::stress(bob) also removes the cumulation possibility and both CONFER and ProbLog2 give 0.096 as an answer.

Next, the well-known earthquake example. CONFER performs both cumulation and collecting negative evidence.

```
person(john). person(mary).
0.7::burglary. 0.2::earthquake.
0.9::alarm :- burglary, earthquake.
0.8::alarm :- burglary, \+earthquake.
0.1::alarm :- \+burglary, earthquake.
0.8::calls(X) :- alarm, person(X).
0.1::calls(X) :- \+alarm, person(X).
evidence(calls(john),true).
evidence(calls(mary),true).
query(burglary).
query(earthquake).
```
We will present the ProbLog2 and CONFER results with both the positive and negative evidence components (columns CONFER + and CONFER -) given by CONFER. Importantly, by default CONFER will try to find up to 10 different proofs: increasing or decreasing these limits has a noticeable effect on the results as well as running time.


Finally we bring the famous penguin example from default logic. We will formulate it using confidences instead of defaults. We state that penguins form a tiny subset of birds. The CONFER implementation collects both positive and negative evidence, but there are no cumulation possibilities.

```
1.0::bird(tweety). 1.0::penguin(pennie).
1.0:: bird(X) :- penguin(X).
0.001:: penguin(X) :- bird(X).
0.9:: flies(X) :- bird(X).
1.0:: not flies(X) :- penguin(X).
query(flies(_)).
```


#### **4.2 Performance**

We will investigate the performance of our CONFER implementation on the following nontrivial example FOL problems from the TPTP collection [40]. Due to restrictions in the language or the principles of the search algorithm, ProbLog2 cannot handle any of these examples even if they are converted to clauses in ProbLog syntax. Thus we will compare the performance of the CONFER system on several modifications of the problems against the conventional FOL prover gkc used as a base for building the CONFER system.

The results are given for the following problems with the TPTP identifier and ratings: 0 means all the provers tested by the TPTP maintaners find a proof, 1 means no prover manages to find a proof. *Steamroller* (PUZ031+1, rating 0) is a puzzle without equality. *Dreadbury* (PUZ001+2.p, rating 0.23) is a puzzle using also equality. *Lukasiewicz* (LCL047-1.p, rating 0) is an example in logical calculi. Commonsense reasoning problems from CYC are taken from the largest consistent CYC version in TPTP: *CSR025+5, CSR035+5,CSR045+5, CSR055+5* (ratings 0.67, 0.83, 0.97, 0.87).

The CYC problems CSR025+5 ... CSR055+5 contain ca half a million formulae, but the proofs are relatively short. The first three problems are relatively small, but their proofs are significantly longer. The Steamroller, Dreadbury and the CYC CSR035+5 problems have been augmented with a question asking for answer substitutions, while for the other CYC problems and the Lukasiewicz problems the conjectures do not contain the existence quantifier, thus we just try to prove these. For comparison purposes the CONFER proof searches are restricted to finding only the first answer (thus no cumulation is possible) and not collecting negative evidence.

We consider both the versions of problems with all clauses assigned a confidence between 0.6 ... 0.99 cyclically with a step 0.01 (column CONFER in the following table) and all the confidences assigned 1.0 (column CONFER 1.0). It is important to note that the CONFER system uses conventional subsumption and simplification for clauses with the confidence 1.0, i.e. in the "CONFER 1" column proof search is reduced to the ordinary resolution search. The gkc column gives the pure search time of the gkc prover used as a base for building the CONFER system, for the original TPTP versions (without a question of substitutions being asked). As a special case, variations 0 ... 4 of the Lukasiewicz problem are formed by attaching confidences below 1 to respectively 1 ... 4 input clauses and letting other confidences have value 1.0. (the Lukasiewicz problem consists of five clauses, one of these being the clause to be proved).

The columns CONFER ... "gkc pure" contain the pure proof search time in seconds using negative ordered resolution for all the problems except CYC and the set of support resolution for CYC. The gkc column gives the pure search time for the gkc prover used as a base for building the CONFER system, for the original TPTP versions (without a question of substitutions being asked). Pure search time does not include printing, parsing and clausifying the problem and indexing the formed clauses. The final column "gkc full" gives full wall clock time for gkc.


We can observe that the confidence and dependency collecting calculations along with the restricted c-subsumption do not have a noticeable effect on performance for most of these problems. However, adding confidences below 1 to the Lukasziewicz problem do incur a significant penalty, which – surprisingly – diminishes somewhat when all the clauses have such confidences. The confidences incur a noticeable penalty to CSR045+5, which has the longest proof among our CYC examples. Our hypotheses is that for these examples the csubsumption along with restricted simplification changes the direction of the search significantly.

#### **5 Summary and Future Work**

We have presented a novel framework CONFER along with the implementation for reasoning with approximate confidences for full, unrestricted first order logic. The presented examples demonstrate that the confidences found by our implementation are similar to the confidences found by the leading probabilistic Prolog and Markov logic implementations ProbLog2 [18] and Alchemy 2 [1]. CONFER is based on conventional first order theorem proving theory and algorithms not requiring saturation, differently from the systems using weighted ground saturation of FOL formulas like ProbLog2 and Alchemy 2. We have shown that this enables the CONFER implementation to efficiently solve large nontrivial FOL problems with attached confidences.

We plan to continue work on the CONFER implementation in several directions: finding and removing bugs, improving the functionality and devising search strategies specialized for the FOL formulas with associated confidences. We expect to integrate machine learning approaches, in particular using semantic similarities for reasoning with analogies and estimating the relevance of input clauses for proof search guidance. The goal of this work is creating a practically usable component for logic-based question answering from large commonsense knowledge bases.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Neural Precedence Recommender

Filip Bártek1,<sup>2</sup> and Martin Suda<sup>1</sup>

<sup>1</sup> Czech Institute of Informatics, Robotics and Cybernetics <sup>2</sup> Faculty of Electrical Engineering Czech Technical University in Prague, Czech Republic {filip.bartek,martin.suda}@cvut.cz

Abstract. The state-of-the-art superposition-based theorem provers for first-order logic rely on simplification orderings on terms to constrain the applicability of inference rules, which in turn shapes the ensuing search space. The popular Knuth-Bendix simplification ordering is parameterized by *symbol precedence*—a permutation of the predicate and function symbols of the input problem's signature. Thus, the choice of precedence has an indirect yet often substantial impact on the amount of work required to complete a proof search successfully.

This paper describes and evaluates a symbol precedence recommender, a machine learning system that estimates the best possible precedence based on observations of prover performance on a set of problems and random precedences. Using the graph convolutional neural network technology, the system does not presuppose the problems to be related or share a common signature. When coupled with the theorem prover Vampire and evaluated on the TPTP problem library, the recommender is found to outperform a state-of-the-art heuristic by more than 4 % on unseen problems.

Keywords: saturation-based theorem proving · simplification ordering · symbol precedence · machine learning · graph convolutional network

# 1 Introduction

Modern saturation-based Automatic Theorem Provers (ATPs) such as E [34], SPASS [40], or Vampire [21] employ the superposition calculus [4,24] as their underlying inference system. Integrating the flavors of resolution [5], paramodulation [30], and the unfailing completion [3], superposition is a powerful calculus with native support for equational reasoning. The calculus is parameterized by a simplification ordering on terms and uses it to constrain the applicability of inferences, with a significant impact on performance.

Both main classes of simplification orderings used in practice, the Knuth-Bendix ordering [19] and the lexicographic path ordering [16], are specified with the help of a *symbol precedence*, an ordering on the signature symbols. While the superposition calculus is refutationally complete for any simplification ordering [4], the choice of the precedence has a significant impact on how long it takes to solve a given problem.

It is well known that giving the highest precedence to the predicate symbols introduced as sub-formula names during clausification [25] can immediately make the saturation produce the exponential set of clauses that the transformation is designed to avoid [29]. Also, certain orderings help to make the superposition a decision procedure on specific fragments of first-order logic (see, e.g., [11,14]). However, the precise way by which the choice of a precedence influences the follow-up proof search on a general problem is extremely hard to predict.

Several general-purpose precedence generating schemes are available to ATP users, such as the successful invfreq scheme in E [33], which orders the symbols by the number of occurrences in the input problem. However, experiments with random precedences indicate that the existing schemes often fail to come close to the optimum precedence [28], suggesting room for further improvements.

In this work, we propose a machine learning system that learns to predict for an ATP whether one precedence will lead to a faster proof search on a given problem than another. Given a previously unseen problem, it can then be asked to recommend the best possible precedence for an ATP to run with. Relying only on the logical structure of the problems, the system generalizes the knowledge about favorable precedences across problems with different signatures.

Our recommender uses a relational graph convolutional neural network [32] to represent the problem structure. It learns from the ATP performance on selected problems and pairs of randomly sampled precedences. This information is used to train a *symbol cost model*, which then realizes the recommendation by simply sorting the problem's symbols according to the obtained costs.

This work strictly improves on our previous experiments with linear regression models and simple hand-crafted symbol features [6] and is, to the best of our knowledge, the first method able to propose good symbol precedences automatically using a non-linear transformation of the input problem structure.

The rest of this paper is organized as follows. Section 2 exposes the basic terminology used throughout the remaining sections. Section 3 proposes a structure of the precedence recommender that can be trained on pairs of symbol precedences, as described in Sect. 4. Section 5 summarizes and discusses experiments performed using an implementation of the precedence recommender. Section 6 compares the system proposed in this work with notable related works. Section 7 concludes the investigation and outlines possible directions for future research.

# 2 Preliminaries

## 2.1 Saturation-Based Theorem Proving

A *first-order logic (FOL) problem* consists of a set of axiom formulas and a conjecture formula. In a *refutation-based automated theorem prover (ATP)*, proving that the axioms entail the conjecture is reduced to proving that the axioms together with the negated conjecture entail a *contradiction*. The most popular first-order logic (FOL) automated theorem provers (ATPs), such as Vampire [21], E [34], or SPASS [40], start the proof search by converting the input FOL formulas to an equisatisfiable representation in *clause normal form (CNF)* [25,13]. We denote the problem in clause normal form (CNF) as P = (Σ, *Cl*), where Σ is a list of all non-logical (predicate and function) *symbols* in the problem called the *signature*, and *Cl* is the set of clauses of the problem (including the negated conjecture).

Given a problem P in CNF, a *saturation-based* ATP searches for a refutational proof by iteratively applying the *inference rules* from the given *calculus* to infer new clauses entailed by *Cl*. As soon as the empty clause, denoted by -, is inferred, the prover concludes that the premises entail the conjecture. The sequence of those inferences leading up from the input clauses *Cl* to the discovered constitutes a proof. If the premises do not entail the conjecture, the proof search continues until the set of inferred clauses is saturated with respect to the inference rules. In the standard setting of timerestricted proof search, a time limit may end the process prematurely.

Since the space of derivable clauses is typically very large, the efficacy of the prover depends on the order in which the inferences are applied. The standard saturation-based ATPs order the inferences by maintaining two classes of inferred clauses: processed and unprocessed [34]. In each *iteration of the saturation loop*, one clause (so-called *given clause*) is combined with all the processed clauses for inferences. The resulting new clauses and the given clause are added to the unprocessed set and the processed set, respectively. Finishing the proof in few iterations of the saturation loop is important because the number of inferred clauses typically grows exponentially during the proof search.

#### 2.2 Superposition Calculus

The *superposition calculus* is of particular interest because it is used in the most successful contemporary FOL ATPs. A *simplification ordering on terms* [4] constrains the inferences of the superposition calculus.

The simplification ordering on terms influences the superposition calculus in two ways. First, the inferences on each clause are limited to the selected literals. In each clause, either a negative literal or all maximal literals are selected. The maximality is evaluated according to the simplification ordering. Second, the simplification ordering orients some of the equalities to prevent superposition and equality factoring from inferring redundant complex conclusions. In each of these two roles, the simplification ordering may impact the direction and, in effect, the length of the proof search.

The *Knuth-Bendix ordering (KBO)* [19], a commonly used simplification ordering scheme, is parameterized by symbol weights and a *symbol precedence*, a permutation<sup>3</sup> of the non-logical symbols of the input problem. In this work, we focus on the task of finding a symbol precedence which leads to a good performance of an ATP when plugged into the Knuth-Bendix ordering (KBO), leaving all the symbol weights at the default value 1 as set by the ATP Vampire.

#### 2.3 Neural Networks

A *feedforward artificial neural network* [12] is a directed acyclic graph of *modules*. Each module is an operation that consumes a numeric (input) *vector* and outputs a numeric vector. Each of the components of the output vector is called a *unit* of the

<sup>3</sup> The definition of KBO does not require the precedence to be total. However, for use in ATPs, the more symbols and thus also terms we can compare, the better.

module. The output of each module is differentiable with respect to the input almost everywhere.

The standard modules include the *fully connected layer*, which performs an affine transformation, and non-linear *activation functions* such as the *Rectified Linear Unit (ReLU)* or *sigmoid*. <sup>4</sup> A fully connected layer with a single unit is called the *linear unit*.

Some of the modules are parameterized by numeric *parameters*. For example, the fully connected layer that transforms the input x by the affine transformation W x+b is parameterized by the weight matrix W and the bias vector b. If the output of a module is differentiable with respect to a parameter, that parameter is considered *trainable*.

In a typical scenario, the neural network is trained by *gradient descent* on a *training set* of *examples*. In such a setting, the network outputs a single numeric value called *loss* when evaluated on a *batch* of examples. The loss of a batch is typically computed as a weighted sum of the losses of the individual examples. Since each of the modules is differentiable with respect to its input and trainable parameters, the gradient of the loss with respect to all trainable parameters of the neural network can be computed using the *back-propagation* algorithm [12]. The trainable parameters are then updated by taking a small step against the gradient—in the direction that is expected to reduce the loss. An *epoch* is a sequence of iterations that updates the trainable parameters using each example in the training set exactly once.

A *graph convolutional network (GCN)* is a special case of feedforward neural network. The modules of a GCN transform messages that are passed along the edges of a graph encoded in the input example. A particular architecture of a GCN used prominently in this work is discussed in Sect. 3.2.

# 3 Architecture

A *symbol precedence recommender* is a system that takes a CNF problem P = (Σ, *Cl*) as the input, and produces a precedence π<sup>∗</sup> over the symbols Σ as the output. For the recommender to be useful, it should produce a precedence that likely leads to a quick search for a proof. In this work, we use the number of iterations of the saturation loop as a metric describing the effort required to find a proof.

The recommender described in this section first uses a neural network to compute a cost value for each symbol of the input problem, and then orders the symbols by their costs in a non-increasing order. In this manner, the task of finding good precedences is reduced to the task of training a good symbol cost function, as discussed in Sect. 4.

The recommender consists of modules that perform specific sub-tasks, each of which is described in detail in one of the following sections (see also Fig. 1).

#### 3.1 Graph Constructor: From CNF to Graphs

As the first step of the recommender processing pipeline, the input problem is converted from a CNF representation to a *heterogeneous (directed) graph* [41]. Each of the nodes of the graph is labeled with a node type, and each edge is labeled with an edge type,

<sup>4</sup> These are, respectively, f(x) = max{0, x} and g(x) = <sup>1</sup> 1+e−<sup>x</sup> .

Fig. 1. Recommender architecture overview. When recommending a precedence, the input is problem <sup>P</sup> and the output is precedence <sup>π</sup><sup>∗</sup>. When training, the input is problem P and precedences π and ρ, and the output is the loss value. The trainable modules and the edges along which the loss gradient is propagated are emphasized by bold lines.

defining the heterogeneous nature of the graph. Each node corresponds to one of the elements that constitute the CNF formula, such as a clause, an atom, or a predicate symbol. Each such category of elements corresponds to one node type. The edges represent the (oriented) relations between the elements, for example, the incidence relation between a clause and one of its (literals') atoms, or the relation between an atom and its predicate symbol. R denotes the set of all relations in the graph. Figure 2 shows the types of nodes and edges used in our graph representation. Figure 3 shows an example of a graph representation of a simple problem.

The graph representation exhibits, namely, the following properties:


#### 3.2 GCN: From Graphs to Symbol Embeddings

For each symbol in the input problem P, we seek to find a vector representation, i.e., an *embedding*, that captures the symbol's properties that are relevant for correctly ranking the symbol in the symbol precedences over P.

Fig. 2. CNF graph schema

The symbol embeddings are output by a *relational graph convolutional network (R-GCN)* [32], which is a stack of *graph convolutional layers*. Each layer consists of a collection of differentiable modules—one module per edge type. The computation of the GCN starts with assigning each node an initial embedding and then iteratively updates the embeddings by passing them through the convolutional layers.

The initial embedding h(0) <sup>a</sup> of a node a is a concatenation of two vectors: a *feature vector* specific for that node (typically empty) and a trainable vector shared by all nodes of the same type. In our particular implementation, feature vectors are used in nodes that correspond to clauses and symbols. Each clause node has a feature vector with a one-hot encoding of the role of the clause, which can be either axiom, assumption, or negated conjecture [38,36]. Each symbol node has a feature vector with two bits of data: whether the symbol was introduced into the problem during preprocessing (most notably during clausification), and whether the symbol appears in a conjecture clause.

One pass through the convolutional layer updates the node embeddings by passing a message along each of the edges. For an edge of type r ∈ R going from source node s to destination node d at layer l, the message is composed by converting the embedding of the source node <sup>h</sup>(l) <sup>s</sup> using the module associated with the edge type r. In the simple case that the module is a fully connected layer with weight matrix <sup>W</sup>(l) <sup>r</sup> and bias vector b (l) <sup>r</sup> , the message is <sup>W</sup>(l) <sup>r</sup> <sup>h</sup>(l) <sup>s</sup> + b (l) <sup>r</sup> . Each message is then divided by the normalization constant cs,d = -|N <sup>r</sup> s | -|N <sup>r</sup> <sup>d</sup> <sup>|</sup> [18], where <sup>N</sup> <sup>r</sup> <sup>a</sup> is the set of neighbors of node a under the relation r.

Once all messages are computed, they are aggregated at the destination nodes to form new node embeddings. Each node d aggregates all the incoming messages of a given edge type r by summation, then passes the sum through an activation function

Fig. 3. Graph representation of the CNF formula a <sup>=</sup> b <sup>∧</sup> f(a, b) <sup>=</sup> f(b, b)

σ such as the ReLU, and finally aggregates the messages across the edge types by summation, yielding the new embedding <sup>h</sup>(l+1) <sup>d</sup> .

The following formula captures the complete update of the embedding of node d by layer l:

$$h\_d^{(l+1)} = \sum\_{r \in \mathcal{R}} \sigma \left( \sum\_{s \in \mathcal{N}\_d^r} \frac{1}{c\_{s,d}} (W\_r^{(l)} h\_s^{(l)} + b\_r^{(l)}) \right)$$

#### 3.3 Output Layer: From Symbol Embeddings to Symbol Costs

The symbol cost of each symbol is computed by passing the symbol's embedding through a linear output unit, which is an affine transformation with no activation function.

It is possible to use a more complex output layer in place of the linear unit, e.g., a feedforward network with one or more hidden layers. Our experiments showed no significant improvement when a hidden layer was added, likely because the underlying GCN learns a sufficiently complex transformation.

Let θ denote the vector of all parameters of the whole neural network consisting of the GCN and the output unit. Given an input problem P with signature Σ = (s1,...,sn), we denote the cost of symbol s<sup>i</sup> predicted by the network as c(i, P; θ). In the rest of this text, we refer to the predicted cost of s<sup>i</sup> simply as c(i) because the problem P and the parameters θ are fixed in each respective context.

#### 3.4 Sort: From Symbol Costs to Precedence

The symbol precedence heuristics commonly used in the ATPs sort the symbols by some numeric syntactic property that is inexpensive to compute, such as the number of occurrences in the input problem, or the symbol arity. In our precedence recommender, we sort the symbols by their costs c produced by the neural network described in Sects. 3.2 and 3.3. An advantage of this scheme is that sorting is a fast operation.

Moreover, as we show in Sect. 4, it is possible to train the underlying symbol costs by gradient descent.

#### 4 Training Procedure

In Sect. 3 we described the structure of a recommender system that generates a symbol precedence for an arbitrary input problem. The efficacy of the recommender depends on the quality of the underlying symbol cost function c. In theory, the symbol cost function can assign the costs so that sorting the symbols by their costs yields an optimum precedence. This is because, at least in principle, all the information necessary to determine the optimum precedence is present in the graph representation of the input problem thanks to the lossless property of the graph encoding. Our approach to defining an appropriate symbol cost function is based on statistical learning from executions of an ATP on a set of problems with random precedences.

To train a useful symbol cost function c, we define a precedence cost function C using the symbol cost function c in a manner that ensures that minimizing C corresponds to sorting the symbols by c. Finding a precedence that minimizes C can then be done efficiently and precisely. We proceed to train C on the proxy task of ranking the precedences.

#### 4.1 Precedence Cost

We extend the notion of cost from symbols to precedences by taking the sum of the symbol costs weighted by their positions in the given precedence π:

$$C(\pi) = Z\_n \sum\_{i=1}^n i \cdot c(\pi(i))$$

Z<sup>n</sup> = <sup>2</sup> <sup>n</sup>(n+1) is a normalization factor that ensures the commensurability of precedence costs across signature sizes. More precisely, normalizing by Z<sup>n</sup> makes the expected value of the precedence cost on a given problem independent of the problem's signature size n, provided the expected symbol cost Ei[c(i)] does not depend on n:

$$\begin{aligned} \mathbb{E}\_{\pi}[C(\pi)] &= \mathbb{E}\_{\pi} \left[ Z\_n \sum\_{i=1}^n i \cdot c(\pi(i)) \right] = Z\_n \sum\_{i=1}^n i \cdot \mathbb{E}\_{\pi}[c(\pi(i))] \\ &= Z\_n \left( \sum\_{i=1}^n i \right) \mathbb{E}\_i[c(i)] = \frac{2}{n(n+1)} \frac{n(n+1)}{2} \mathbb{E}\_i[c(i)] = \mathbb{E}\_i[c(i)] \end{aligned}$$

When C is defined in this way, the precedence produced by the recommender (see Sect. 3.4) minimizes C.

Lemma 1. *The precedence cost* C *is minimized by any precedence that sorts the symbols by their costs in non-increasing order:*

$$\underset{\rho}{\operatorname{argmin}} \, C(\rho) = \operatorname{argsort}^{-}(c(1), \dots, c(n))$$

*where* argmin<sup>ρ</sup> C(ρ) *is the set of all precedences that minimize precedence cost* C *for a given symbol cost* c*, and* argsort−(x) *is the set of all permutations* π *that sort vector* x *in non-increasing order (*x<sup>π</sup>(1) ≥ x<sup>π</sup>(2) ≥ ... ≥ x<sup>π</sup>(n)*).*

*Proof.* We prove direction "argmin<sup>ρ</sup> C(ρ) ⊆ argsort−(c(1),...,c(n))" by contradiction. Let π minimize C and let π not sort the costs in non-increasing order. Then there exist k<l such that c(π(k)) < c(π(l)). Let π¯ be a precedence obtained from π by swapping the elements k and l. Then we obtain

$$\begin{split} \frac{C(\bar{\pi}) - C(\pi)}{Z\_n} &= kc(\bar{\pi}(k)) + lc(\bar{\pi}(l)) - kc(\pi(k)) - lc(\pi(l)) \\ &= kc(\pi(l)) + lc(\pi(k)) - kc(\pi(k)) - lc(\pi(l)) \\ &= k(c(\pi(l)) - c(\pi(k))) - l(c(\pi(l)) - c(\pi(k))) \\ &= (k - l)(c(\pi(l)) - c(\pi(k))) \\ &< 0 \end{split}$$

The final inequality is due to k − l < 0 and c(π(l)) − c(π(k)) > 0. Clearly, Z<sup>n</sup> > 0 for any n ≥ 0. Thus, C(¯π) < C(π), which contradicts the assumption that π minimizes C.

To prove the other direction of the equality, first observe that all precedences π that sort the symbol costs in a non-increasing order necessarily have the same precedence cost C(π). Since ∅ = argmin<sup>ρ</sup> C(ρ) ⊆ argsort−(c(1),...,c(n)), each of the precedences in argsort−(c(1),...,c(n)) has the cost min<sup>ρ</sup> C(ρ). It follows that argsort−(c(1),...,c(n)) ⊆ argmin<sup>ρ</sup> C(ρ).

#### 4.2 Learning to Rank Precedences

Our ultimate goal is to train the precedence cost function C so that it is minimized by the best precedence, measuring the quality of a precedence by the number of iterations of the saturation loop taken to solve the problem.

Approaching this task directly, as a regression problem, runs into the difficulty of establishing sensible target cost values for the precedences in the training dataset, especially when a wide variety of input problems is covered. Approaching the task as a binary classification of precedences seems possible, but it is not clear which precedences should be a priori labeled as positive and which as negative, to give a guarantee that a precedence minimizing the precedence cost (i.e. the one obtained by sorting) would be among the best in any good sense.

We cast the task as an instance of score-based ranking problem [23,7] by training a classifier to decide which of a *pair* of precedences is better based on their costs. We train the classifier in a way that ensures that better precedences are assigned lower costs. The motivation for learning to order pairs of precedences is that it allows learning on easy problems, and that it may allow the system to generalize to precedences that are better than any of those seen during training.

Training Data. Each training example has the form (P, π, ρ), where P = (Σ, *Cl*) is a problem and π, ρ are precedences over Σ such that the prover using π solves P in fewer iterations of the saturation loop than with ρ, denoted as π ≺<sup>P</sup> ρ.

Loss Function. Let (P, π, ρ) be a training example (π ≺<sup>P</sup> ρ). The precedence cost classifies this example correctly if C(π) < C(ρ), or alternatively S(π, ρ) = C(ρ) − C(π) > 0. We approach this problem as an instance of binary classification with the logistic loss [23], a loss function routinely used in classification tasks in machine learning:

$$\begin{aligned} \ell(P, \pi, \rho) &= -\log \text{sigmoid}\, S(\pi, \rho) = -\log \text{sigmoid}(C(\rho) - C(\pi)) \\ &= -\log \text{sigmoid}\, Z\_n \sum\_{i=1}^n i (c(\rho(i)) - c(\pi(i))) \end{aligned}$$

Note that the classifier cannot simply train S to output a positive number on all pairs of precedences because S is defined as a difference of two precedence costs. Intuitively, by training on the example (P, π, ρ) we are pushing C(π) down and C(ρ) up.

The loss function is clearly differentiable with respect to the symbol costs, and the symbol cost function c is differentiable with respect to its trainable parameters. This enables the use of gradient descent to find the values of the parameters of c that locally minimize the loss value.

Figure 1 shows how the loss function is plugged into the recommender for training.

# 5 Experimental Evaluation

To demonstrate the capacity of the trainable precedence recommender described in Sects. 3 and 4, we performed a series of experiments. In this section, we describe the design and configuration of the experiments, and then compare the performance of several trained models to a baseline heuristic.

The scripts that were used to generate the training data and to train and evaluate the recommender are available online.5

#### 5.1 Environment

System. All experiments were run on a computer with the CPU Intel Xeon Gold 6140 (72 cores @ 2.30 GHz) and 383 GiB RAM.

<sup>5</sup> https://github.com/filipbartek/vampire-ml/tree/cade28

Solver. The empirical evaluation was performed using a modified version of the ATP Vampire 4.3.0 [21]. The prover was used to generate the training data and to evaluate the trained precedence recommender. To generate the training data, Vampire was modified to output CNF representations of the problems and annotated problem signatures in a machine-readable format. For the evaluation of the precedences generated by the recommender, Vampire was modified to allow the user to supply explicit predicate and function symbol precedences for the proof search (normally, the user only picks a precedence generation heuristic). The modified version of Vampire is available online.6

We run Vampire with a fixed strategy<sup>7</sup> and a time limit of 10 seconds. To increase the potential impact of predicate precedences, we used a simple transfinite Knuth-Bendix ordering (TKBO) [22,20] that compares atoms according to the predicate precedence first, using the regular KBO to break ties between atoms and to compare terms (using the Vampire option --literal\_comparison\_mode predicate).

#### 5.2 Dataset Preparation

The training data consists of examples of the form (P, π, ρ), where P is a CNF problem and π, ρ are precedences of symbols of problem P such that out of the two precedences, π yields a proof in fewer iterations of the saturation loop (see Sect. 2.1).

Since the TKBO never compares a predicate symbol with a function symbol, two separate precedences can be considered for each problem: a predicate precedence and a function precedence. We trained a predicate precedence recommender separately from a function precedence recommender to simplify the training process and to isolate the effects of the predicate and function precedences. This section describes how the training data for the case of training a *predicate* precedence recommender was generated. Data for training the function precedence recommender was generated analogously.

Base Problem Set. The input problems were assumed to be specified in the CNF or the first-order form (FOF) fragment of the TPTP language [36]. FOF problems were first converted into equisatisfiable CNF problems by Vampire.

We used the problem library TPTP v7.4.0 [36] as the source of problems for training and evaluation of the recommender. We denote the set of all problems available for training and evaluation as P<sup>0</sup> (|P0| = 17 053).

Node Feature Extraction. In addition to the signature and the structure of the problem, some metadata was extracted from the input problem to allow training a more efficient recommender. First, each clause was annotated with its role in the problem, which could be either axiom, assumption, or negated conjecture. Second, each symbol was annotated with two bits of data: whether the symbol was introduced into the problem during preprocessing, and whether the symbol appeared in a conjecture clause. This metadata was used to construct the initial embeddings of the respective nodes in the graph representation of the problem (see Sect. 3.2).

<sup>6</sup> https://github.com/filipbartek/vampire/tree/cade28

<sup>7</sup> Saturation algorithm: DISCOUNT, age to weight ratio: 1:10, AVATAR [39]: disabled, literal comparison mode: predicate; all other options left at their default values.

Examples Generation. The examples were generated by an iterative sampling of P0. In each iteration, a problem P ∈ P<sup>0</sup> was chosen and Vampire was executed twice on P with two (uniformly) random predicate precedences and one common random function precedence. The "background" random function precedence served as additional noise (in addition to the variability contained in TPTP) and made sure that the predicate precedence recommender would not be able to rely on any specificity that would come from fixing function precedences in the training data.

The two executions were compared in terms of performance: the predicate precedence π was recognized as better than the predicate precedence ρ, denoted as π ≺<sup>P</sup> ρ, if the proof search finished successfully with π and if the number of iterations of the saturation loop with π was smaller than with ρ. If one of the two precedences was recognized as better, the example (P, π, ρ) would be produced, where π was the better precedence, and ρ was the other precedence. Otherwise, for example, if the proof search timed out on both precedences, we would go back to sampling another problem.

To ensure the efficiency of the sampling, we interpreted the process as an instance of the Bernoulli multi-armed bandit problem [37], with the reward of a trial being 1 in case an example is produced, and 0 otherwise.

We employed adaptive sampling to balance exploring problems that have been tried relatively scarcely and exploiting problems that have yielded examples relatively often. For each problem P ∈ P0, the generator kept track of the number of times the problem has been tried n<sup>P</sup> , and the number of examples generated from that problem s<sup>P</sup> . The ratio <sup>s</sup><sup>P</sup> <sup>n</sup><sup>P</sup> corresponded to the average reward of problem P observed so far. The problems were sampled using the allocation strategy UCB1 [1] with a parallelizing relaxation.

First, the values of n<sup>P</sup> and s<sup>P</sup> for each problem P were bootstrapped by sampling the problem a number of times equal to a lower bound on the final value of n<sup>P</sup> (at least 1).<sup>8</sup> In each subsequent iteration, the generator sampled the problem P that maximized s<sup>P</sup> <sup>n</sup><sup>P</sup> + 2 ln <sup>n</sup> <sup>n</sup><sup>P</sup> , where n = <sup>P</sup> ∈P<sup>0</sup> <sup>n</sup><sup>P</sup> was the total number of tries on all problems. The parallelizing relaxation means that the s<sup>P</sup> values were only updated once in 1000 iterations, allowing up to 2000 parallel solver executions.

The sampling continued until 1 000 000 examples were generated when training a predicate precedence recommender, or 800 000 examples in the case of a function precedence recommender. For example, while generating 1 000 000 examples for the predicate precedence dataset, 5349 out of the 17 053 problems yielded at least one example, while the least explored problem was tried 19 times, and the most exploited problem 504 times.

Validation Split. The 17 053 problems in P<sup>0</sup> were first split roughly in half to form the training set and the validation set. Next, both training and validation sets were restricted to problems whose graph representation consisted of at most 100 000 nodes to limit the memory requirements of the training. Approximately 90 % of the problems fit into this limit and there were 7648 problems in the resulting validation set Pval. The training

<sup>8</sup> The number of tries each problem was bootstrapped with is <sup>n</sup><sup>0</sup> <sup>=</sup> 2 log <sup>N</sup> (1+- 2 log <sup>N</sup>|P0<sup>|</sup> <sup>N</sup> )<sup>2</sup> , where

N is the final number of examples to be generated. For example, if N = 1 000 000 and |P0<sup>|</sup> = 17 053, then <sup>n</sup><sup>0</sup> = 10.

set Ptrain was further restricted to problems that correspond to at least one training example, resulting in 2571 problems when training a predicate precedence recommender, and 1953 problems when training a function precedence recommender.

#### 5.3 Hyperparameters

We used a GCN described in Sect. 3.2 with depth 4, message size 16, ReLU activation function, skip connections [41], and layer normalization [2]. We tuned the hyperparameters by a small manual exploration.

#### 5.4 Training Procedure

A symbol cost model was trained by gradient descent on the precedence ranking task (see Sect. 4.2) using the examples generated from Ptrain. To avoid redundant computations, all examples generated from any given problem were processed in the same training batch. Thus, each training batch contained up to 128 problems and all examples generated from these problems. The symbol cost model was trained using the Adam optimizer [17]. The learning rate started at <sup>1</sup>.<sup>28</sup> <sup>×</sup> <sup>10</sup>−<sup>3</sup> and was halved each time the loss on Ptrain stagnated for 10 consecutive epochs.

The examples were weighted. Each of the examples of problem P contributed to the training with the weight <sup>1</sup> <sup>s</sup><sup>P</sup> , where s<sup>P</sup> was the number of examples of problem P in the training set. This ensured that each problem contributed to the training to the same degree irrespective of the relative number of examples.

We continued the training until the validation accuracy stopped increasing for 100 consecutive epochs.

#### 5.5 Final Evaluation

After the training finished, we performed a final evaluation of the most promising intermediate trained model on the whole Pval. The model that manifested the best solver performance on a sample of 1000 validation problems was taken as the most promising.

#### 5.6 Results

A predicate precedence recommender was trained on approximately 500 000 examples, and a function precedence recommender was trained on approximately 400 000 examples. For each problem P ∈ Pval, a predicate and a function precedences were generated by the respective trained recommender, and Vampire was run using these precedences with a wall clock time limit of 10 seconds. The results are averaged over 5 runs to reduce the effect of noise due to the wall clock time limit. As a baseline, the performance of Vampire with the frequency precedence heuristic9 was evaluated with the same time limit. For comparison, the two trained recommenders were evaluated separately, with the predicate precedence recommender using the frequency heuristic to generate the function precedences, and vice versa.

<sup>9</sup> This is Vampire's analogue of the invfreq scheme in E [33].

To generate a precedence for a problem, the recommender first converts the problem to a machine-friendly CNF format, then converts the CNF to a graph, then predicts symbol costs using the GCN model and finally orders the symbols by their costs to produce the precedence. To simplify the experiment, the time limit of 10 seconds was only imposed on the Vampire run, excluding the time taken by the recommender to generate the precedence. When run with 2 threads, the preprocessing of a single problem took at most 1.26 seconds for 80 % of the problems by extrapolation from a sample of 1000 problems.<sup>10</sup> Table 1 shows the results of the final evaluation.

Table 1. Results of the evaluation of symbol precedence heuristics based on various symbol cost models on Pval (|Pval| = 7648). Means and standard deviations over 5 runs are reported. The GCN models were trained according to the description in Sects. 3 to 5. The model Simple is the final linear model from our previous work [6]. The models that used machine learning only for the predicate precedence used the frequency heuristic for the function precedence, and vice versa. The frequency model uses the standard frequency heuristic for both predicate and function precedence.


The results show that the GCN-based model outperformed the frequency heuristic by a significant margin. Since the predicate precedence recommender was trained with randomly distributed function precedences, it was expected to perform well irrespective of the function precedence heuristic it is combined with, and conversely. Combining the trained recommenders for predicate and function precedences manifested better performance than any of the two in combination with the standard frequency heuristic, outperforming the frequency heuristic by approximately 4.8 %.

We have confirmed our earlier conjecture [6] that using a graph neural network (GNN) may outperform the "simple" linear predicate precedence heuristic trained in [6].11

# 6 Related Work

Our previous text [6] marked the initial investigation of applying techniques of machine learning to generating good symbol precedences. The neural recommender presented here uses a GNN to model symbol costs, while [6] used a linear combination of symbol features readily available in the ATP Vampire. The GNN-based approach yields more performant precedences at the cost of longer training and preprocessing time.

<sup>10</sup> The remaining 20 % of the problems either finished preprocessing within 5 seconds, or were omitted from preprocessing due to exceeding the node count limit.

<sup>11</sup> The measurements presented in Table 1 are not directly comparable with those reported in [6] due to differences in the validation problem sets and the computation environments.

In [26], [15] and [27], the authors propose similar GNN architectures to solve tasks on FOL problems. They use the GNNs to solve classification tasks such as premise selection. While our system is trained on a proxy classification task, the main task it is evaluated on is the generation of useful precedences.

The problem of learning to rank objects represented by scores trainable by gradient descent was explored in [7]. Our work can be seen to apply the approach of [7] to rank permutations represented by weighted sums of symbol costs.

## 7 Conclusion and Future Work

We have described a system that extracts useful symbol precedences from the graph representations of CNF problems. Comparison with a conventional symbol precedence heuristic shows that using a GCN to consider the whole structure of the input problem is beneficial.

A manual analysis of the trained recommender could produce new insights into how the choice of the symbol precedence influences the proof search, which could in turn help design new efficient precedence generating schemes. Indeed, a trained cost model summarizes the observed behaviors of an ATP with random precedences and is able to discover patterns in them (as we know implicitly from its accuracy) despite their seemingly chaotic behavior as perceived by a human observer. The challenge is to extract these patterns in a human-understandable form.

In addition to the symbol precedence, KBO is determined by symbol *weights*. In this work, we keep the symbol weights fixed to the value 1. Learning to recommend symbol weights in addition to the precedences represents an interesting avenue for future research.

The same applies to the idea of learning to recommend both the predicate and function precedences using a single GCN. The joint learning, although more complex to design, could additionally discover interdependencies between the effects of function precedence and predicate precedence on the proof search, while the current setup implicitly assumes that the effects are independent. Finally, a higher training data efficiency could be achieved by considering all pairs of measured executions on a problem in one training batch.

## Acknowledgments

This work was generously supported by the Czech Science Foundation project no. 20- 06390Y (JUNIOR grant), the project RICAIP no. 857306 under the EU-H2020 programme, and the Grant Agency of the Czech Technical University in Prague, grant no. SGS20/215/OHK3/3T/37.

#### References

1. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (May 2002). https://doi.org/10.1023/A:1013689704352


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Improving ENIGMA-style Clause Selection while Learning From History**

Martin Suda

Czech Technical University in Prague, Prague, Czech Republic martin.suda@cvut.cz

**Abstract.** We re-examine the topic of machine-learned clause selection guidance in saturation-based theorem provers. The central idea, recently popularized by the ENIGMA system, is to learn a classifier for recognizing clauses that appeared in previously discovered proofs. In subsequent runs, clauses classified positively are prioritized for selection. We propose several improvements to this approach and experimentally confirm their viability. For the demonstration, we use a recursive neural network to classify clauses based on their derivation history and the presence or absence of automatically supplied theory axioms therein. The automatic theorem prover Vampire guided by the network achieves a 41 % improvement on a relevant subset of SMT-LIB in a real time evaluation.

**Keywords:** Saturation-based theorem proving · Clause Selection · Machine Learning · Recursive Neural Networks.

#### **1 Introduction**

The idea to improve the performance of saturation-based automatic theorem provers (ATPs) with the help of machine learning (ML), while going back at least to the early work of Schulz [8, 30], has recently been enjoying a renewed interest. Most notable is the ENIGMA system [16,17] extending the ATP E [31] by machine learned clause selection guidance. The architecture trains a binary classifier for recognizing as positive those clauses that appeared in previously discovered proofs and as negative the remaining selected ones. In subsequent runs, clauses classified positively are prioritized for selection.

A system such as ENIGMA needs to carefully balance the expressive power of the used ML model with the time it takes to evaluate its advice. For example, Loos et al. [22], who were the first to integrate state-of-the-art neural networks with E, discovered their models to be too slow to simply replace the traditional clause selection mechanism. In the meantime, the data-hungry deep learning approaches motivate researchers to augment training data with artificially crafted theorems [1]. Yet another interesting aspect is what features we allow the model to learn from. One could speculate that the recent success of ENIGMA on the Mizar dataset [7, 18] can at least partially be explained by the involved problems sharing a common source and encoding. It is still open whether some new form of general "theorem proving knowledge" could be learned to improve the performance of an ATP across, e.g., the very diverse TPTP library.

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5 31 543–561, 2021.

In this paper, we propose several improvements to ENIGMA-style clause selection guidance and experimentally test their viability in a novel setting:


To test these ideas, we designed a recursive neural network to classify clauses based solely on their derivation history and the presence or absence of automatically supplied theory axioms therein. This allows us to test here, as a byproduct of the conducted experiments, whether the human-engineered heuristic for controlling the amount of theory reasoning presented in our previous work [11] can be matched or even overcome by the automatically discovered neural guidance.

The rest of the paper is structured as follows. Sect. 2 recalls the necessary ATP theory, explains clause selection and how to improve it using ML. Sect. 3 covers layered clause selection and the new lazy evaluation scheme. In Sect. 4, we describe our neural architecture and in Sect. 5 we bring everything together and evaluate the presented ideas, using the prover Vampire as our workhorse and a relevant subset of SMT-LIB as the testing grounds. Finally, Sect. 6 concludes.

# **2 ATPs, Clause Selection, and Machine Learning**

The technology behind the modern automatic theorem provers (ATPs) for firstorder logic (FOL), such as E [31], SPASS [40], or Vampire [21], can be roughly outlined by using the following three adjectives.

Refutational: The task of the prover is to check whether a given conjecture G logically follows from given axioms A1,...,An, i.e. whether

$$|A\_1, \ldots, A\_n| = G,\tag{1}$$

where G and each A<sup>i</sup> are FOL formulas. The prover starts by negating the conjecture G and transforming ¬G, A1,...,A<sup>n</sup> into an equisatisfiable set of clauses C. It then applies a sound logical calculus to iteratively derive further clauses, logical consequence of C, until the obvious contradiction in the form of the empty clause ⊥ is derived. This refutes the assumption that ¬G, A1,...,A<sup>n</sup> could be satisfiable and thus confirms (1).

Superposition-based: The most popular calculus used in this context is superposition [3,23], an extension of ordered resolution [4] with a built-in support for handling equality. It consists of several inference rules, such as the resolution rule, factoring, subsumption, superposition, or demodulation.

Inference rules in general determine how to derive new clauses from old ones, where by old clauses we mean either the initial clauses C or clauses derived previously. The clauses that need to be present for a rule to be applicable are called the premises and the newly derived clause is called the conclusion. By applying the inference rules the prover gradually constructs a derivation, a directed acyclic (hyper-)graph (DAG), with the initial clauses forming the leaves and the derived clauses (labeled by the respective applied rules) forming the internal nodes. A proof is the smallest sub-DAG of a derivation containing the final empty clause and for every derived clause the corresponding inference and its premises.

Saturation-based: A saturation algorithm is the concrete way of organizing the process of deriving new clauses, such that every applicable inference is eventually considered. Modern saturation-based ATPs employ some variant of the givenclause algorithm, in which clauses are selected for inferences one by one [27].

The process employs two sets of clauses, often called the active set A and the passive set P. At the beginning all the initial clauses are put to the passive set. Then in every iteration, the prover selects and removes a clause C from P, inserts it into A, and performs all the applicable inferences with premises in A such that at least one of the premises is C. The conclusions of these inferences are then inserted into P. This way the prover maintains (at the end of each iteration) the invariant that inferences among the clauses in the active set have been performed. The selected clause C is sometimes also called the "given clause".

During a typical prover run, P grows much faster than A (the growth is roughly quadratic). Analogously, although for different reasons, when a proof is discovered, its clauses constitute only a fraction of A. Notice that every clause C ∈ A that is in the end not part of the proof did not need to be selected and represents a wasted effort. This explains why clause selection, i.e. the procedure for picking in each iteration the next clause to process, is one of the main heuristic decision points in the prover, which hugely affects its performance [32].

#### **2.1 Traditional Approaches to Clause Selection**

There are two basic criteria that have been identified as generally correlating with the likelihood of a clause contributing to the yet-to-be discovered proof.

One is clause's age or, more precisely, its "date of birth", typically implemented as an ever increasing timestamp. Preferring for selection old clauses to more recently derived ones corresponds to a breadth-first strategy and ensures fairness. The other criterion is clause's size, referred to as weight in the ATP lingo, and is realized by some form of symbol counting. Preferring for selection small clauses to large ones is a greedy strategy, based on the observation that small conclusions typically belong to inferences with small premises and that the ultimate conclusion—the empty clause—is the smallest of all. The best results are achieved when these two criteria (or their variations) are combined [32].

To implement efficient clause selection by numerical criteria such as age and weight, an ATP represents the passive set P as a set of priority queues. A queue contains (pointers to) the clauses in P ordered by its respective criterion. Selection typically alternates between the available queues under a certain ratio.

A successful strategy is, for instance, to select 10 clauses by weight for every clause selected by age, i.e., with an age-to-weight ratio of 1:10.

#### **2.2 ENIGMA-style Machine-Learned Clause Selection Guidance**

The idea to improve clause selection by learning from previous prover experience goes, to the best of our knowledge, back to Schulz [8, 30] and has more recently been successfully employed by the ENIGMA system and others [7, 15–17, 22].

The experience is collected from successful prover runs, where each selected clause constitutes a training example and the example is marked as positive, if the clause ended-up in the discovered proof, and negative otherwise. A machine learning (ML) algorithm is then used to fit this data and produce a model M for classifying clauses into positive and negative, accordingly. A good learning algorithm produces a model M which not only accurately classifies the training data but also generalizes well to unseen examples. The computational costs of both training and evaluation are also important.

While clauses are logical formulas, i.e., discrete objects forming a countable set, ML algorithms, rooted in mathematical statistics, are primarily equipped to dealing with fixed-seized real-valued vectors. Thus the question of how to represent clauses for the learning is the first obstacle that needs to be overcome, before the whole idea can be made to work. In the beginning, the authors of ENIGMA experimented with various forms of hand-crafted numerical clause features [16,17]. An attractive alternative explored in later work [7,15,22] is the use of artificial neural networks, which can be understood as extracting the most relevant features automatically.

An important distinction can in both cases be made between approaches which have access to the concrete identity of predicate and function symbols (i.e., the signature) that make up the clauses, and those that do not. For example: Is the ML algorithm allowed to assume that the symbol grp mult is used to represent the multiplication operation in a group or does it only recognize a general binary function? The first option can be much more powerful, but we need to ensure that the signature symbols are aligned and used consistently across the problems in our benchmark. Otherwise the learned advice cannot meaningfully cary over to previously unsolved problems. While the assumption of aligned signature has been employed by the early systems [16, 22], the most recent version of ENIGMA [15, 24] can work in a "signature agnostic" mode.

In this work we represent clauses solely by their derivation history, deliberately ignoring their logical content. Thus we do not require the assumption of an aligned signature, per se. However, we rely on a fixed set of distinguished axioms to supply features in the derivation leaves.

#### **2.3 Integrating the Learned Advice**

Once we have a trained model M, an immediate possibility for integrating it into the clause selection procedure is to introduce a new queue that will order the clauses using M. Two basic versions of this idea have been described:

"Priority": The ordering puts all the clauses classified by M as positive before those classified negatively. Within the two classes, older clauses are preferred.

Let us for the purposes of future reference denote this scheme <sup>M</sup>1,0. It has been successfully used by the early ENIGMAs [7, 16, 17].

"Logits": Even models officially described as binary classifiers typically internally compute a real-valued estimate L of how much "positive" or "negative" an example appears to be and only turn this estimate into a binary decision in the last step, by comparing it against a fixed threshold t, most often 0. A machine learning term for this estimate L is the logit. 1

The second version orders the clauses on the new queue by the "raw" logits produced by a model. We denote it <sup>M</sup><sup>−</sup><sup>R</sup> to stress that clauses with high <sup>L</sup> are treated as small from the perspective of the selection and therefore preferred. This scheme has been used by Loos et al. [22] and in the latest ENGIMA [15,37].

Combining with a traditional strategy. While it is possible to rely exclusively on selection governed by the model, it turns out to be better [7] to combine it with the traditional heuristics. The most natural choice is to take S, the original strategy that was used to generate the training data, and extend it by adding the new queue, be it <sup>M</sup><sup>1</sup>,<sup>0</sup> or <sup>M</sup><sup>−</sup><sup>R</sup>, next to the already present queues. We then again supply a ratio under which the original selection from S and the new selection based on M get alternated. We will denote this kind of combination with the original strategy as S⊕M<sup>1</sup>,<sup>0</sup> and S⊕M<sup>−</sup><sup>R</sup>, respectively.

#### **3 Layered Clause Selection and Lazy Model Evaluation**

Layered clause selection (LCS) is a recently developed method [10, 11, 36] for smoothly incorporating a categorical preference for certain clauses into a base clause selection strategy S. In this paper, we will readily use it in combination with the binary classifier advice from a trained model M.

When we instantiate LCS to our particular case,<sup>2</sup> its function can be summarized by the expression

$$
\mathcal{S} \oplus \mathcal{S}[\mathcal{M}^1].
$$

In words, the base selection strategy <sup>S</sup> is alternated with <sup>S</sup>[M<sup>1</sup>], the same selection scheme S but applied only to clauses classified positively by M. Implicit here is a convention that whenever there is no positively classified passive clause, a fallback to plain S occurs. Additionally, we again specify a "second-level" ratio to govern the alternation between pure <sup>S</sup> and <sup>S</sup>[M<sup>1</sup>].

The main advantage of LCS, compared to the options outlined in the previous section, is that the original, typically well-tuned, base selection mechanism S is also applied to <sup>M</sup><sup>1</sup>, the clauses classified positively by <sup>M</sup>.

<sup>1</sup> A logit can be turned into a (formal) probability, i.e. a value between 0 and 1, by passing it, as is typically done, through the sigmoid function σ(x)=1/(1 + e−<sup>x</sup>).

<sup>2</sup> We rely here on the monotone mode of split; there is also a disjoint mode [10].

#### **3.1 Lazy Model Evaluation**

It is often the case that evaluating a clause by the model M is a relatively expensive operation [22]. As we explain here, however, this operation can be avoided in many cases, especially when using LCS to integrate the advice.

We propose the following lazy evaluation approach to be used with S⊕S[M1]. Every clause entering the passive set <sup>P</sup> is initially inserted to both <sup>S</sup> and <sup>S</sup>[M1] without being evaluated by M. Then, whenever (as governed by the second-level ratio) it is the moment to select a clause from <sup>S</sup>[M1], the algorithm


This repeats until the first positively classified clause is found, which is then returned. Note that this way the "observable behaviour" of <sup>S</sup>[M<sup>1</sup>] is preserved.

The power of lazy evaluation lies in the fact that not every clause needs to be evaluated before a proof is found. Indeed, recall the remark that the passive set P is typically much larger than the active set A, which also holds on a typical successful termination. Every clause left in passive at that moment is a clause that did not need to be evaluated by M thanks to lazy evaluation.

We remark that lazy evaluation can similarly be used with the integration mode <sup>M</sup><sup>1</sup>,<sup>0</sup> based on priorities.

We experimentally demonstrate the effect of the technique in Sect. 5.4.

## **4 A Neural Classification of Clause Derivations**

In this work we choose to represent a clause, for the purpose of learning, solely by its derivation history. Thus a clause can only be distinguished by the axioms from which it was derived and by the precise way in which these axioms interacted with each other through inferences in the derivation. This means we deliberately ignore the clause's logical content.

We decided to focus on this representation, because it promises to be fast. Although an individual clause's derivation history may be large, it is a simple function of its parents' histories (just one application of an inference rule). Moreover, before a clause with a complicated history can be selected, most of its ancestors will have been selected already.<sup>3</sup> This guarantees the amortised cost of evaluating a single clause to be constant.

A second motivation comes from our recent work [11], where we have shown that theory reasoning facilitated by automatically adding theory axioms for axiomatising theories, while in itself a powerful technique, often leads the prover to unpromising parts of the search space. We developed a heuristic for controlling the amount of theory reasoning in the derivation of a clause [11]. Our goal here is to test whether a similar or even stronger heuristic can be automatically discovered by a neural network.

<sup>3</sup> Exceptions are caused by simplifying inferences applied eagerly outside of the governance of the main clause selection mechanism.

Examples of axioms that Vampire uses to axiomatise theories include the commutativity or associativity axioms for the arithmetic operations, an axiomatization of the theory of arrays [6] or of the theory of term algebras [20]. For us it is mainly important that the axioms are introduced internally by the prover and can therefore be consistently identified across individual problems.

#### **4.1 Recursive Neural Networks**

A recursive neural network (RvNN) is a network created by recursively composing a finite set of neural building blocks over a structured input [12]. A general neural block is a function <sup>N</sup><sup>θ</sup> : <sup>R</sup><sup>k</sup> <sup>→</sup> <sup>R</sup><sup>l</sup> depending on a vector of parameters <sup>θ</sup> that can be optimized during training (see below in Section 4.3).

In our case, the structured input is a clause derivation, i.e. a DAG with nodes identified with the derived clauses. To enable a recursion, an RvNN represents each node C by a real vector v<sup>C</sup> (of a fixed dimension n) called a (learnable) embedding. During training a network learns to embed the space of derivable clauses into R<sup>n</sup> in some a priori unknown, but still useful way.

We assume that each initial clause C, a leaf of the derivation DAG, is labeled as belonging to one of the automatically added theory axioms or coming from the user input. Let these labels form a finite set of axiom origin labels LA. Furthermore, let the applicable inference rules that label the internal nodes of the DAG form a finite set of inference rule labels LR. The specific building blocks of our neural architecture are the following three (indexed families of) functions:


By recursively composing the init and deriv functions, any derived clause C can be assigned an embedding v<sup>C</sup> and also evaluated by E to see whether the network recommends it as positive, that should be preferred in proof search.

#### **4.2 Architecture Details**

Here we outline the details of our architecture for the benefit of neural network practitioners. All the used terminology is standard (see, e.g., [13]).

We realized each init function I<sup>l</sup> as an independent learnable vector. Similarly, each deriv function D<sup>r</sup> was independently defined. For a rule of arity two, such as resolution, we used:

$$D\_r(v\_1, v\_2) = \text{LayerNorm}(y), \ y = W\_2^r \cdot x + b\_2^r, \ x = \text{ReLU}(W\_1^r \cdot [v\_1, v\_2] + b\_1^r),$$

where [·, ·] denotes vector concatenation, ReLU is the rectified linear unit nonlinearity (f(x) = max{0, x}) applied component-wise, and the learnable matrices W<sup>r</sup> <sup>1</sup> , W<sup>r</sup> <sup>2</sup> and vectors b<sup>r</sup> 1, b<sup>r</sup> <sup>2</sup> are such that <sup>x</sup> <sup>∈</sup> <sup>R</sup>2<sup>n</sup> and <sup>y</sup> <sup>∈</sup> <sup>R</sup>n. (We took inspiration from Sandler et al. [29] for doubling the embedding size before applying the non-linearity.) Finally, LayerNorm is a layer normalization [2] module, without which training often became numerically unstable for deeper derivation DAGs.<sup>4</sup>

For unary inference rules, such as factoring, we used an equation analogous to the above, except for the concatenation operation. We did not need to model an inference rule with a variable number of premises, but one option would be to arbitrarily "bracket" its arguments into a tree of binary applications.

Finally, the eval function was E(v) = W<sup>2</sup> ·ReLU(W<sup>1</sup> ·v+b)+c with trainable <sup>W</sup><sup>1</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>×<sup>n</sup>, b <sup>∈</sup> <sup>R</sup><sup>n</sup>, W<sup>2</sup> <sup>∈</sup> <sup>R</sup>1×<sup>n</sup>, and <sup>c</sup> <sup>∈</sup> <sup>R</sup>.

#### **4.3 Training the Network**

To train a network means to find values for the trainable parameters such that it accurately classifies the training data and ideally also generalises to unseen future cases. We follow a standard methodology for training our RvNN.

In particular, we use the gradient descent (GD) optimization algorithm (with the Adam optimiser [19]) minimising the typical binary cross-entropy loss, composed as a sum of contributions, for every selected clause C, of the form

$$-y\_C \cdot \log(\sigma(E(v\_C))) - (1 - y\_C) \cdot \log(1 - \sigma(E(v\_C))),$$

with y<sup>C</sup> = 1 for the positive and y<sup>C</sup> = 0 for the negative examples.

These contributions are weighted such that each derivation DAG (corresponding to a prover run on a single problem) receives equal weight. Moreover, within each DAG we re-scale the influence of positive versus the negative examples such that these two categories contribute evenly. The scaling is important as our training data is highly unbalanced (cf. Sect. 5.1).

We split the available successful derivations into a training set and a validation set, and only train on the first set using the second to observe generalisation to unseen examples. As the GD algorithm progresses, iterating over the training data in rounds called epochs, we evaluate the loss on the validation set and stop the process early if this loss does not decrease for a specified period. This early stopping criterion was important to produce a model that generalizes well.

As another form of regularisation, i.e. a technique for preventing overfitting to the training data, we employ dropout [35] (independently for each "read" of a clause embedding by one of the deriv or eval functions). Dropout means that at training time each component v<sup>i</sup> of the embedding v has a certain probability of being zero-ed out. This "voluntary brain damage" makes the network more robust as it prevents neurons from forming too complex co-adaptations [35].

Finally, we experimented with using non-constant learning rates as suggested by Smith et al. [33,34]. In the end, we used a schedule with a linear warmup for the first 50 epochs followed by a hyperbolic cooldown [38] (cf. Fig. 1 in Sect. 5.2).

<sup>4</sup> We also tried to skip LayerNorm and replace ReLU by the hyperbolic tangent function. This restores stability, but does not train or classify so well.

#### **4.4 An Abstraction for Compression and Caching**

Since our representation of clauses deliberately discards information, we end up encountering distinct clauses indistinguishable from the perspective of the network. For example, every initial clause C originating from the input problem (as opposed to being added as a theory axiom) receives the same embedding v<sup>C</sup> = Iinput. Indistinguishable clauses also arise as conclusions of an inference that can be applied in more than one way to certain premises.

Mathematically, we deal with an equivalence relation ∼ on clauses based on "having the same derivation tree": C<sup>1</sup> ∼ C<sup>2</sup> ↔ derivation(C1) = derivation(C2). The "fingerprint" derivation(C) of a clause could be defined as a formal expression recording the derivation history of C using the labels from L<sup>A</sup> as nullary operators and those from L<sup>R</sup> as operators with arities of the corresponding inference rules. For example: Resolution(thax inverse assoc, Factoring(input)).

We made use of this equivalence in our implementation in two places:


#### **5 Experiments**

We implemented the infrastructure for training an RvNN clause derivation classifier (as described in Sect. 4) in Python, relying on the PyTorch (version 1.7) library [25] and its TorchScript extension for interfacing the trained model from C++. We modified the automatic theorem prover Vampire (version 4.5.1) to (1) optionally record to a log file the constructed derivation, including information on selected clauses and clauses found in the discovered proof (the logging-mode), (2) to be able to load a trained TorchScript model and use it for clause selection guidance under various modes of integration (detailed in Sects. 2.3 and 3).<sup>5</sup>

We took the same subset of 20 795 problems from the SMT-LIB library [5] as in previous work [11]: formed as the largest set of problems in a fragment supported by Vampire, excluding problems known to be satisfiable and those provable by Vampire's default strategy in 10 s either without adding theory axioms or while performing clause selection by age only.

As the baseline strategy S we took Vampire's implementation of the DIS-COUNT saturation loop under the age-to-weight ratio 1:10 (which typically performs well with DISCOUNT), keeping all other settings default, including the enabled AVATAR architecture. We later enhanced this S with various forms of guidance. All the benchmarking was done using a 10 s time limit.<sup>6</sup>

<sup>5</sup> Supplementary materials can be found at https://git.io/JtHNl.

<sup>6</sup> Running on an Intel(R) Xeon(R) Gold 6140 CPUs @ 2.3 GHz server with 500 GB RAM, using no more than 30 of the available 72 cores to reduce mutual influence.

#### **5.1 Data Preparation**

During an initial run, the baseline strategy S was able to solve 734 problems under the 10 s time limit. We collected the corresponding successful derivations using the logging-mode (and lifting the time limit, since the logging causes a non-negligible overhead) and processed them into a form suitable for training a neural model. The derivations contained approximately 5.0 million clauses in total (the overall context), out of which 3.9 million were selected<sup>7</sup> (the training examples) and 30 thousand of these appeared in a proof (the positive examples). In these derivations, Vampire used 31 distinct theory axioms to facilitate theory reasoning. Including the "user input" label for clauses coming from the actual problem files, there were in total 32 distinct labels for the derivation leaves. In addition, we recorded 15 inference rules, such as resolution, superposition, backward and forward demodulation or subsumption resolution and including one rule for the derivation of a component clause in AVATAR [26, 39]. Thus we obtained 15 distinct labels for the internal nodes.

We compressed these derivations identifying clauses with the same "abstract derivation history" dictated by the labels, as described in Sect. 4.4. This reduced the derivation set to 0.7 million nodes (i.e. abstracted clauses) in total. Out of the 734 derivations 242 were still larger than 1000 nodes (the largest had 6426 nodes) and each of these gave rise to a separate "mini-batch". We grouped the remaining 492 derivations to obtain an approximate size of 1000 nodes per minibatch (the maximum was 12 original derivations grouped in one mini-batch). In total, we obtained 412 mini-batches and randomly singled out 330 (i.e., 80 %) of these for training, keeping 82 aside for validation.

#### **5.2 Training**

Since the size of the training set is relatively small, we instantiated the architecture described in Sect. 4.2 with embedding size n = 64 and dropout probability p = 0.3. We trained for 100 epochs, with a non-constant learning rate peaking at <sup>α</sup> = 2.<sup>5</sup> <sup>×</sup> <sup>10</sup>−<sup>4</sup> in epoch 50. Every epoch we computed the loss on the validation set and selected the model which minimizes this quantity. This was the model from epoch 45 in our case, which we will denote M here.

The development of the training and validation loss throughout training, as well as that of the learning rate, is plotted in Fig. 1. Additionally, the right side of the figure allows us to compare the validation loss—an ML estimate of the model's ability to generalize—with the ultimate metric of practical generalization, namely the number of in-training-unseen problems solved by Vampire equipped with the corresponding model for guidance.<sup>8</sup> We can see that the "proxy" (i.e. the minimisation of the validation loss) and the "target" (i.e. the maximisation of ATP performance) correspond quite well, at least to the degree that we measured the highest ATP gain with the validation-loss-minimizing M.

<sup>7</sup> Ancestors of selected clauses are sometimes not selected clauses themselves if they arise through immediate simplifications or through reductions.

<sup>8</sup> Integrated using the layered scheme with a second level ratio 2:1 (cf. Sect. 5.3).

**Fig. 1.** Training the neural model. Red: the training (left) and validation (right) loss as a function training time; shaded: per problem weighted standard deviations. Blue (left): the supplied non-constant learning rate (cf. Sect. 4.3). Green (right): in training unseen problems solved by Vampire equipped with the corresponding model.

We remark that this assurance was not cheap to obtain. While the whole 100 epoch training took 45 minutes to complete (using 20 workers and 1 master process in a parallel training setup), each of the 20 ATP evaluation data points corresponds to approximately 2 hours of 30 core computation.

#### **5.3 Advice Integration**

In this part of the experiment we tested the various ways of integrating the learnt advice as described in Sects. 2.3 and 3. Let us recall that these are the single queue schemes <sup>M</sup><sup>−</sup><sup>R</sup> and <sup>M</sup><sup>1</sup>,<sup>0</sup> based on the raw logits and the binary decision, respectively, their combinations S⊕M<sup>−</sup><sup>R</sup> and S⊕M<sup>1</sup>,<sup>0</sup> with the base strategy <sup>S</sup> under some second level ratio, and, finally, S⊕S[M<sup>1</sup>], the integration of the guidance by the layered clause selection scheme.

Our results are shown in Table 1. It starts by reporting on the performance of the baseline strategy S and then compares it to the other strategies (the gained and lost columns are w.r.t. the original run of <sup>S</sup>).<sup>9</sup> We can see that the two single queue approaches are quite weak, with the better <sup>M</sup><sup>1</sup>,<sup>0</sup> solving only 25 % of the baseline. Nor can the combination S⊕M<sup>−</sup><sup>R</sup> be considered a success, as it only solves more problems when less and less advice is taken, seemingly approaching the performance of <sup>S</sup> from below. This trend repeats with S⊕M<sup>1</sup>,<sup>0</sup>, although here an interesting number of problems not solved by the baseline is gained by strategies which rely on the advice more than half of the time.

With our model <sup>M</sup>, only the layered clause selection integration S⊕S[M<sup>1</sup>] is able to improve on the performance of the baseline strategy S. In fact, it

<sup>9</sup> We had to switch to a different machine after producing the training data. There, a rerun of S gave a slightly better performance than the 734 solved problems used for training. We still used the original run's results to compute the gained and lost values here; the percentage solved is with respect to the new run of S.


**Table 1.** Performance results of various forms of integrating the model advice.

**Table 2.** Performance decrease caused by turning off abstraction caching and lazy evaluation, and both; demonstrated on S⊕S[M<sup>1</sup>] under the second level ratio 1:2.


improves on it very significantly: with the second level ratio of 1:2 we achieve 137 % performance of the baseline and gain 430 problems unsolved by S.

#### **5.4 Evaluation Speed, Lazy Evaluation, and Abstraction Caching**

Table 1 also shows the percentage of computation time the individual strategies spent evaluating the advice, i.e. interfacing M.

A word of warning first. These number are hard to interpret across different strategies. It is because different guidance steers the prover to different parts of the search space. For example, notice the seemingly paradoxical situation most pronounced with S⊕M<sup>−</sup><sup>R</sup>, where the more often is the advice from <sup>M</sup> nominally requested, the less time the prover spends interfacing M. Looking closely at a few problems, we discovered that in strategies relying a lot on <sup>M</sup><sup>−</sup><sup>R</sup>, such as S⊕M<sup>−</sup><sup>R</sup> under the ratio 1:5, most of the time is spent performing forward subsumption. An explanation is that the guidance becomes increasingly bad and the prover slows down, processing larger and larger clauses for which the subsumption checks are expensive and dominate the runtime.<sup>10</sup>

<sup>10</sup> A similar experience with bad guidance has been made by the authors of ENIGMA.

**Fig. 2.** The receiver operating characteristic curve (left) and a related plot with explicit threshold (right) for the selected model M; both based on validation data.

When the guidance is the same, however, we can use the eval. time percentage to estimate the efficiency of the integration. The results shown in Table 1 were obtained using both lazy evaluation<sup>11</sup> and abstraction caching (as described in sections 3.1 and 4.4). Taking the best performing S⊕S[M<sup>1</sup>] under the second level ratio 1:2, we selectively disabled: first abstraction caching, then lazy evaluation and finally both techniques, obtaining the values shown in Table 2.

We can see that the techniques considerably contribute to the overall performance. Indeed, without them Vampire would spend the whole 73 % of computation time evaluating the network (compared to only 33 %) and the strategy would barely match (with 103 %) the performance of the baseline S.

#### **5.5 Positive Bias**

Two important characteristics, from a machine learning perspective, of an obtained model are the true positive rate (TPR) (also called sensitivity) and the true negative rate (TNR) (also specificity). TPR is defined as the fraction of positively labeled examples which the model also classifies as such. TNR is, analogously, the fraction of negatively labeled examples. Our model M achieves (on the validation set) 86 % TPR and 81 % TNR.

The final judgement of a neural classifier follows from a comparison to a threshold value t, set by default to t = 0 (recall Sect. 4.1). Changing this threshold allows us to trade TPR for TNR and vice versa in straightforward way. The interdependence of these two values on the varied threshold is traditionally captured by the so called receiver operating characteristic (ROC) curve, shown for our model in Fig. 2 (left). The tradition dictates that the x axis be labeled by the false positive rate (FPR) (also called fall-out) which is simply 1 − TNR. Under such presentation, one generally strives to pick a threshold value at which the

<sup>11</sup> With the exception of the <sup>M</sup><sup>−</sup><sup>R</sup> guidance, with which it is incompatible.

**Table 3.** The performance of S⊕S[M<sup>1</sup>] under the second level ratio 1:2 while changing the logit threshold. A smaller threshold means more clauses classified as positive.


curve is the closest to the upper left corner of the plot.<sup>12</sup> However, this is not necessarily the best configuration for every application.

In the Fig. 2 (right), we "decompose" the ROC curve by using the threshold t for the independent axis x. We also highlight, for every problem (again, in the validation set), what is the minimal logit value across all positively labeled examples belonging to that problem. In other words, what is the logit of the "least positively classified" clause from the problem's proof. We can see that for the majority of the problems these minima are below the threshold t = 0. This means that for those problems at least one clause from the original proof is getting classified as negative by M under t = 0.

These observations motivated us to experiment with non-zero values of the threshold in an ATP evaluation. Particularly promising seemed the use of a threshold t smaller than zero with the intention of classifying more clauses as positive. The results of the experiment are in shown Table 3. Indeed, we could further improve the best performing strategy from Table 1 with both t = −0.25 and t = −0.5. It can be seen that smaller values lead to fewer problems lost, but even the ATP gain is better with t = −0.25 than with the default t = 0, leading to the overall best improvement of 141 % with respect to the baseline S.

#### **5.6 Learning from Guided Proofs and Negative Mining**

As previously unsolved problems get proven with the help of the trained guidance, the new proofs can be used to enrich the training set and potentially help obtaining even better models. This idea of alternating the training and the ATP evaluation steps in a reinforcing loop has been proposed and successfully realized by the authors of ENIGMA on the Mizar dataset [18]. Here we propose an enhancement of the idea and repeat an analogous experiment in our setting.

By collecting proofs discovered by a selection of 8 different configurations tested in the previous sections, we grew our set of solved problems from 734 to 1528. We decided to keep one proof per problem, strictly extending the original training set. We then repeated the same training procedure as described in Sect. 5.2 on this new set and on an extension of this set obtained as follows.

Negative mining: We suspected that the successful derivations obtained with the help of M might not contain enough "typical wrong decisions" from the

<sup>12</sup> Minimizing the standard cross entropy loss should actually automatically "bring the curve" close to that corner for the threshold t = 0.

**Table 4.** The performance of new models learned from guided proofs. U is the set of 1528 problems used for the training. The gained and lost counts are here w.r.t. U.


perspective of S to provide for good enough training. We therefore logged the failing runs of S on the (1528 − 734) problems only solved by one of the guided strategies and augmented the corresponding derivations with these.<sup>13</sup>

Table 4 confirms<sup>14</sup> that negative mining indeed helps to produce a better model. Mainly, however, it shows that training from additional derivations further dramatically improves the performance of the obtained strategy.

#### **6 Conclusion**

We revisited the topic of ENIGMA-style clause selection guidance by a machine learned binary classifier and proposed four improvements to previous work: (1) the use of layered clause selection for integrating the advice, (2) the lazy evaluation trick to reduce the overhead of interfacing a potentially expensive model, (3) the "positive bias" idea suggesting to be really careful not to discard potentially useful clauses, and (4) the "negative mining" technique to provide enough negative examples when learning from proofs obtained with previous guidance.

We have also shown that a strong advice can be obtained by looking just at the derivation history to discriminate a clause. The automatically discovered neural guidance significantly improves upon the human-engineered heuristic [11] under identical conditions. Rerunning S with the theory heuristic enabled in its default form [10] resulted here in 816 (107 %) solved problems.

By deliberately focusing of the representation of clauses by their derivations, we obtained some nice properties, such as relative speed of evaluation. However, in situations where theory reasoning by automatically added theory axioms is not prevalent, such as on most of the TPTP library, we expect guidance based on derivations with just a single axiom origin label, the input, to be quite weak.

Still, we see a great opportunity in using statistical methods for analyzing ATP behaviour; not only for improving prover performance with a black box guidance, but also as a tool for discovering regularities that could be exploited to improve our understanding of the technology on a deeper level.

#### **Acknowledgement**

This work was supported by the Czech Science Foundation project 20-06390Y and the project RICAIP no. 857306 under the EU-H2020 programme. We also thank the anonymous reviewers for suggesting numerous improvements.

<sup>13</sup> Negative mining has, for instance, been previously used when training deep models for the premise selection task [14].

<sup>14</sup> The ATP eval was again integrating via S⊕S[M<sup>1</sup>] under the second level ratio 1:2.

# **References**


Spain. pp. 2235–2243 (2016), https://proceedings.neurips.cc/paper/2016/hash/ f197002b9a0853eca5e046d9ca4663d5-Abstract.html


cial Intelligence and Applications, vol. 325, pp. 1395–1402. IOS Press (2020). https://doi.org/10.3233/FAIA200244


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **System Descriptions**

# **A Normative Supervisor for Reinforcement Learning Agents**

Emery Neufeld<sup>1</sup> , Ezio Bartocci<sup>1</sup> , Agata Ciabattoni<sup>1</sup> , and Guido Governatori<sup>2</sup>

> <sup>1</sup> TU Wien, Vienna, Austria <sup>2</sup> Data61, CSIRO, Melbourne, Australia

**Abstract.** We introduce a modular and transparent approach for augmenting the ability of reinforcement learning agents to comply with a given norm base. The normative supervisor module functions as both an event recorder and real-time compliance checker w.r.t. an external norm base. We have implemented this module with a theorem prover for defeasible deontic logic, in a reinforcement learning agent that we task with playing a "vegan" version of the arcade game Pac-Man.

## **1 Introduction**

Autonomous agents are an increasingly integral part of modern life. While performing activities formerly reserved for human agents, they must possess the ability to adapt to (potentially unpredictable) changes in their environment; reinforcement learning (RL) has proven a successful method for teaching agents this behaviour (see, e.g. [16,13]). Performing human roles further requires that agents align themselves with the ethical standards their human counterparts are subject to, introducing a requirement for ethical reasoning. RL has been employed to enforce such standards as well (see, e.g., [14]); agents can be trained to act in line with further rewards/penalties assigned according to the performance of ethical/unethical behaviour through a reward function. However, this does not provide a guarantee of the desired behaviour. Moreover, such techniques are not well equipped to handle the complexities of ethical reasoning. In general, like other black-box machine learning methods, RL cannot transparently explain why a certain policy is compliant or not. Additionally, when the ethical values are embedded in the learning process, a small change in their definition would require us to retrain the policy from scratch.

To obviate the limitations of RL to represent ethical norms, the approach we follow in this paper combines RL with Deontic Logic, the branch of formal logic that is concerned with prescriptive statements; we implement a normative supervisor to inform a trained RL agent of the ethical requirements in force in a given situation. Since the pioneering works [17,15], it has been well understood that Deontic Logic can be applied to model ethical norms; the difference between ethical and legal norms is indeed only on how they emerge, not what normative consequences are entailed by them. We implement our normative supervisor using

<sup>1</sup> This work was partially supported by WWTF project MA16-28 and the DC-RES run by the TU Wien's Faculty of Informatics and the FH-Technikum Wien.

<sup>©</sup> The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5 32 565–576, 2021.

defeasible deontic logic [8,9]. This is a simple and computationally feasible, yet expressive, logic allowing for defeasible reasoning, and can easily accommodate changes to the norm base, should the ethical requirements become more complex (see Sect. 3.4 for a brief walk-through). Moreover, the constructive nature of this logic allows us to determine how a given conclusion has been reached.

By embedding the normative supervisor into the RL agent architecture, the agent can follow near-optimal learned policies while enforcing ethical actions in a modular and transparent way. The supervisor functions as both an event recorder and real-time compliance checker; it corrects the choice of a given action from the policy only when this violates a norm. It is furthermore used as an event logger to identify and extract new sets of (ethical) norms to promote particular goals. We demonstrate our approach on an RL agent that plays a "vegan" version of Pac-Man, with an "ethical" constraint forbidding Pac-Man from eating ghosts. Already used as a case study in [14,10], the Pac-Man game is a closed environment for testing with clearly defined game mechanics and parameters which are easy to isolate, manipulate, and extend with variably intricate rule sets. We successfully evaluated our approach with several tests, consisting of "vegan" games and a "vegetarian" version of the game where the agent can eat only one type of ghost. The achievement of full compliance in the latter case was possible with the introduction of additional norms identified via the event recorder.

**Related Work.** The papers [14] and [10] on Pac-Man motivated our work. The former employs multi-objective RL with policy orchestration to impose normative constraints on vegan Pac-Man. It seamlessly combines ethically compliant behaviour and learned optimal behaviour; however, the ethical reasoning performed is still to a degree implicit, it does not provide justifications for the choices made, and it is not clear how the approach would remain reasonably transparent with more complex norm sets. [10] takes steps to integrate more complex constraints on a RL agent, but as they are embedded in the learned policy, it lacks the transparency of a logic-based implementation. [1] and [2] address the problem of transparency in the implementation of ethical behaviours in AI, but their approach has not been implemented and tested yet. Symbolic reasoning for implementing ethically compliant behaviour in autonomous agents has been used in many frameworks, such as [5], which models the behaviour from a BDI perspective. This approach does not allow for defeasible reasoning, and focuses on avoiding ethical non-compliance at the planning level. Non-monotonic logic-based approaches that extend BDI with a normative component appear in [6,9], whose solutions remain only at the theoretical level. These papers belong to the related field of Normative Multi-Agent Systems, which is not specifically concerned with the ethical behaviour of agents [3], and whose introduced formalisms and tools (e.g. [12]) have not yet been used in combination with RL.

# **2 Background**

**Normative Reasoning.** Normative reasoning differs from the reasoning captured by classical logic in that the focus is not on true or false statements, but rather the imposition of norms onto such statements.

We will deal with two types of norms: constitutive and regulative norms (see [4] for the terminology). Regulative norms describe obligations, prohibitions and permissions. Constitutive norms regulate instead the creation of institutional facts as well as the modification of the normative system itself; their content is a relation between two concepts, and they will typically take the form "in context c, concept x counts as concept y", where x refers to a more concrete concept (e.g., walking) and y to a more abstract one (e.g., moving). We say concept x is at a lower level of abstraction than concept y in context c if there is a constitutive norm with context c asserting that x counts as y (henceforth denoted **C**(x, y)).

**Reinforcement Learning (RL).** RL refers to a class of algorithms specialized in learning how an agent should act in its environment to maximize the expected cumulative reward. Given a function that assigns rewards/penalties to each state and successor state pair (or state-action pairs), the RL algorithm learns an optimal policy, a function from states to actions that can govern its behaviour.

In our case study we chose Q-learning [18] with function approximation as a RL algorithm. In Q-learning, the RL algorithm first learns a function Q(s, a) to predict the expected cumulative reward (Q-value) from state s taking action a. The learned policy picks the action argmaxa∈possible Q(s, a) with the highest Q-value over a list of possible actions. The function Q is approximated as a linear function which is the weighted sum of features describing some elements of the environment (e.g., the distance between the agent and object X); the features which are most relevant to predicting the agent success are weighted most heavily.

**Vegan Pac-Man.** In the arcade game Pac-Man, an eponymous agent is situated inside a maze over a grid, where some cells contain a 'food pellet' which Pac-Man will eat if it moves inside the cell. Pac-Man's goal is to maximize his score; when Pac-Man eats a food pellet he gains a reward (+10 points), but there is also a time penalty (−1 point for every time step). Pac-Man wins when he has eaten all the food pellets in the maze (resulting in +500 points), and he loses if he collides with one of the ghost agents wandering around the maze (resulting in −500 points). However, after eating a 'power pellet' (of which there are two), the ghosts become 'scared', and Pac-Man can eat them (for +200 points).

Inspired by [14], we consider a variation of the UC Berkeley AI Pac-Man implementation [7], where Pac-Man cannot eat ghosts (only blue ghosts in the vegetarian version). Our Pac-Man agent utilizes a Q-learning policy; for the utility function we use the game's score, and we take the game states as states. We use the same game layout as in [14]; this is a 20 × 11 maze populated with 97 food pellets and two ghosts (blue and orange) which follow random paths, where the maximum score available is 2170, and 1370 when eating ghosts is forbidden.

#### **3 The Normative Supervisor**

The key component of our approach is a normative supervisor whose architecture is illustrated in Fig. 1. This module consists of a normative reasoning engine (we use the SPINdle theorem prover [11]), and of other components that encode the norms and environmental data into defeasible deontic logic rules, and translate the conclusions of the reasoning engine into instructions for the agent.

**Fig. 1.** Key components and placement of the Normative Supervisor.

**Normative Supervisor**

We place the normative supervisor in the already-trained agent's control loop between the localization and policy module. The localization module identifies the current agent's state with respect to its environment and returns a list of possible actions to the normative supervisor. This module filters out all the actions that are not compliant with the norms. The policy will then identify, among the pool of the compliant actions, the optimal one for generating the next game state. If there are no available compliant actions the normative supervisor will select the 'lesser evil' action. This module additionally enables the logging of events during the game for later scrutiny.

#### **3.1 Configuring the Norm Base**

We start with a simple normative prescription, consisting only of the behavioral constraint proposed in [14] that "Pac-Man must not eat ghosts"<sup>2</sup>, represented as vegan ∶ **F**(eat(pacman, ghost)), where **F** denotes prohibition.

If this norm base is to inform our agent's actions, it needs to reference concepts that correspond to the information directly processed by the agent, which is limited to the locations of game entities and the actions that Pac-Man can perform, which we denote as North, South, East, W est, and Stop. The only way eat(pacman, ghost) can be done is if (a) the ghost is in a 'scared' state, and (b) Pac-Man and the ghost move into the same cell. These are expressed as scared(ghost) and inRange(pacman, ghost) respectively. Pac-Man does not know which direction the ghost will move in, but we will assume a "cautious" model of action where Pac-Man is not to perform any action that could constitute eating a ghost; that is, if Pac-Man takes an action that could reasonably lead to him violating a norm, we will consider that norm violated. Since Pac-Man's next action determines what is in range, we will actually need five entities to express inRange(pacman, ghost), one corresponding to each action. These concepts are used to construct a constitutive norm, or a kind of strategy, regarding eating, strategyNorth ∶ **C**(North, eat(pacman, ghost)), which is applicable in the context {scared(ghost), inNorthRange(pacman, ghost)}.

<sup>2</sup> For the time being we generalize the blue and the orange ghosts as ghost.

For inNorthRange(pacman, ghost), we have access to the positions of Pac-Man and the ghosts, so we can create another set of constitutive norms for this, which apply in the context {pacman(i, j)}, rangeNorth ∶ **C**(ghost(k,l), inNorthRange(pacman, ghost)), where (k,l) has a Manhattan distance of one or fewer cells from (i, j + 1).

Finally, we need to consider additional relationships between norms and concepts. For this norm base, we only have one regulative norm, so a mechanism for conflict resolution is not needed. However, as Pac-Man can only execute one action at a time, we have a non-concurrence relation between every action. This amounts to an inability to comply with multiple obligations over distinct actions. However, since Vegan Pac-Man does not deal with any obligations, additional rules will not be needed.

**Representing the Norm Base.** We need a formal language – equipped with an automated theorem prover – capable of effectively representing and reasoning with the norm base; we chose defeasible deontic (propositional) logic (DDPL for short) [8]. DDPL is defined over literals and modal literals, and the key ingredient is the rules we can construct from them. For the purposes of this paper we only consider one deontic modality (obligation **O**) and define prohibition and permission as **F**(p) ≡ **O**(¬p) and **P**(p) ≡ ¬**O**(¬p).

**Definition 1.** A rule is an expression r∶A(r) ↪<sup>∗</sup> N(r) where r is a label uniquely identifying the rule, A(r) = {a1, ..., an} is the antecedent, N(r) is the consequent, ↪∗∈ {→∗,⇒∗,-<sup>∗</sup>}, and the mode of each rule is designated with ∗ ∈ {C, O}.

Rules labelled by C and O are constitutive and regulative rules, respectively. Strict rules (→∗) are rules where the consequent strictly follows from the antecedent without exception. Defeasible rules (⇒∗) are rules where the consequent typically follows from the antecedent, unless there is evidence to the contrary. Defeaters (-∗) are rules that only prevent a conclusion from being reached by a defeasible rule; regulative defeaters are used to encode permissive rules (see [8]).

The central concept of DDPL (and our application of it) is:

**Definition 2.** A defeasible theory D is a tuple ⟨F,RO, R<sup>C</sup> , >⟩, where F is a set of literals (facts), R<sup>O</sup> and R<sup>C</sup> are sets of regulative and constitutive rules, and > is a superiority relation over rules.

These tools will be utilized to map Pac-Man's to a defeasible theory; the environment translated to a set of facts and the norm base to a set of rules.

#### **3.2 Automating Translation**

We are now dealing with three kinds of syntax: our informal representation of the norm base, the input and output of the host process, and the formal language of the reasoner (DDPL and its theorem prover SPINdle [11]). If we frame the reasoner as a central reasoning facility, the agent as a front-end, and the norm base as a back-end, we can implement this dynamic as a translator with two faces, one front-facing and one back-facing, feeding information into the reasoner from the agent and the norm base respectively.

**Front End Translation.** The front-end translator will be continuously in use, sending new data to be translated and requiring translated proposed actions as the environment changes. This will be an algorithm that transforms input data from the agent into propositions which assert facts about the agent or the environment, and then logical conclusions into instructions the agent will understand. Each cell of the Pac-Man grid can contain characters (Pac-Man or one of the ghosts), an object (a wall or a food pellet), or nothing at all. Walls are accounted for during the localization stage of Pac-Man's algorithm and food pellets are not an entity that appears in the norm base, so we will need to reason only about the characters. Hence we have two sets of variables in each game; pacmani,j and ghosti,j (along with scared(ghost) if the ghost is in a scared state) assert the current coordinates of Pac-Man and of each ghost, and appear in a set F acts in the defeasible theory GameState = ⟨F acts, R<sup>C</sup> , RO, >⟩.

Actions will be represented as deontic literals, in the set

Actions = {North, South, East,W est, Stop}

A query from Pac-Man to the reasoner will be accompanied by a representation of the current game state, along with a list of possible actions, possible, which will be translated to the corresponding literal in Actions.

**Back End Translation.** In this critical task it is crucial to ensure that norms dictate the same behaviour once translated into this language. Besides making sure that each component of the norm can be represented by the language, we must also analyse our translated norm base with respect to how the available metadata is accommodated by the reasoner's rules of inference.

We represent the regulative norm of Vegan Pac-Man (vegan) as:

$$
\Rightarrow\_O \neg eat\_{pacman, ghost} \in R\_O
$$

where defeasibility is given as a precautionary measure, in case we want to add (potentially conflicting) norms later.

Note that if moving North counts as eating a ghost, an obligation to go North counts as being obligated to eat a ghost, and a prohibition to eat a ghost implies a prohibition to move North. So we can rewrite strategyNorth as **C**(**O**(¬eat(pacman, ghost)), **O**(¬North)), or with the applicable context as:

# scaredghost, inNorthRangepacman,ghost, **O**(¬eatpacman,ghost) ⇒<sup>O</sup> ¬North ∈ R<sup>O</sup>

Note that though this a constitutive rule, in DDPL it will be in RO. This will work for all of the constitutive norms attached to a prohibited action, where we place the context and the prohibition in question in the antecedent, and the prohibition of the concrete action in the strategy is the consequent.

For the remaining constitutive norms, we have a rather simple conversion. These norms will be generated w.r.t. the input from the agent; for example, if the agent (Pac-Man) tells us that he is at (2, 3), the rule rangeNorth will be:

# pacman2,3, ghost2,<sup>4</sup> →<sup>C</sup> inNorthRangepacman,ghost ∈ R<sup>C</sup>

We have found that it is more time-efficient to generate these constitutive norms anew whenever the fact set changes, instead of generating every possible constitutive norm ahead of time, and having SPINdle deal with all at once.

#### **3.3 Classify and Assess Conclusions**

Once we understand how various concepts are represented in the reasoner language, we need to parse the possible outputs of the reasoning engine into indicators as to which actions in the agent's arsenal are compliant with the norm base.

**Compliant Solutions.** Ideally, we will want to locate a compliant solution – an action that constitutes a possible course of action for the agent that does not violate any norms – from the conclusions yielded by the reasoner.

**Definition 3.** A set of compliant solutions is: (1) non-empty, and consisting only of (2) solutions composed of possible actions, (3) solutions that do not violate any norms, and (4) solutions that are internally consistent.

The manner in which we construct such a set is heavily influenced by the output (conclusions) yielded by SPINdle. Conclusions in DDPL are established over proofs and can be classified as defeasible or definite, and positive or negative. A positive conclusion means that the referenced literal holds, while a negative indicates that this literal has been refuted. A definite conclusion is obtained by using only strict rules and facts using forward chaining of rules. A conclusion holds defeasibly (denoted by +∂<sup>C</sup> for a factual conclusion and +∂<sup>O</sup> for an obligation) if there is an applicable rule for it and the rules for the opposite cannot be applied or are defeated. Over the course of a proof, each rule will be classified as either applicable (i.e., the antecedent holds and the consequent follows), discarded (i.e., the rule is not applied because the antecedent doesn't fully hold), or defeated by a defeater or a higher priority rule. For a set of rules R, R[p], R<sup>O</sup> and Rsd are, respectively, the subsets of: the rules for p, regulative rules, and strict or defeasible rule. The definition of provability for defeasible obligations [8] (we define only defeasible conclusions, because in our formalization regulative norms were expressed as defeasible rules) is:

**Definition 4.** Given a defeasible theory D, if D ⊢ +∂<sup>O</sup> p, then:


A derivation in DDPL has a three phase argumentation structure, where arguments are simply applicable rules: (1) we need an argument for the conclusion we want to prove, (2) we analyse all possible counter-arguments, and (3) we rebut the counter-arguments. An argument can be rebutted when it is not applicable or when it is defeated by a stronger applicable argument. If we exclude the undercut case, in every phase the arguments attack the arguments in the previous phase. A rule attacks another rule if the conclusions of the two rules are contradictory (note that **P**(q) and **P**(¬q) are not a deontic contradiction). Accordingly, any regulative rule for q attacks a strict or defeasible regulative rule for ¬q. However, a regulative defeater for q is not attacked by a regulative defeater for ¬q (condition 2(c) above).

We parse out a solution set by: (1) if we do not receive a full set of conclusions from SPINdle, we return an empty set; (2) we remove all conclusions that do not reference a literal in possible; (3) any action corresponding to a defeasibly proved positive literal occurs in every solution; and (4) any action corresponding to a defeasibly proved negative literal is discarded from every solution.

**Claim:** the above procedure yields either an empty set or a compliant solution. Proof sketch: If our solution is not internally consistent, we can prove both +δ<sup>O</sup> a and +δO¬a for some action a. In this case SPINdle will return neither, and the above procedure leads to an empty set in step (1). Only possible actions will occur in a solution as per step (2), and any solutions which fail to comply with an obligation or prohibition will be excluded through step (3) and (4) respectively. **'Lesser of two Evils' Solutions.** If the above procedure leaves us with an empty solution set, we want to identify which non-compliant actions constitute the "best" choice (i.e. are minimally non-compliant). Our characterization of degrees of non-compliance depends on the way the reasoner constructs solutions, and what information it logs during this process. SPINdle has an inference logger that classifies every rule in the theory as discarded, applicable, or defeated. For our agent, the chosen degree is a score derived from the of norms that have been applied versus those that have been defeated (discarded norms are ignored):

score ∶= #complied − #violated = #applied − #defeated

This score is computed through the theory GameStatea, which is constructed by adding a fact **O**(a) to GameState. Recall that a rule will be defeated when its defeasible theory includes a fact that conflicts with the head of this rule. So when we add **O**(a) to GameState, all norms that prescribed **F**(a) = **O**(¬a) for GameState are defeated and any prescribing **O**(a) is applied. To compute the score, we use SPINdle in a rather unconventional way, ignoring conclusions yielded and checking the inference log to count which rules have been applied during reasoning (#applied) and which were defeated (#defeated) and set score = #applied − #defeated. This procedure is completed for every action in possible, and we select the action(s) with the highest score. If there are multiple actions with a highest score, we send multiple solutions to the agent and it will pick the best action according to its policy.

**Claim:** computing scores for all possible actions is completed in polynomial time. Proof sketch: As shown in [8], conclusions in DDPL can be computed in linear time with respect to the the number of literal occurrences plus the number of the rules in the theory. The claim holds since every action in possible is a literal, and the above procedure is completed ∣possible∣ times.

#### **3.4 Revising the Norm Base**

We demonstrate the advantages of our approach – modularity, configurability, and capability as an event recorder – through revising our norm base.

Inherent to Pac-Man's environment is the possibility of encountering a state where no compliant action is possible; in this section we explore how to address cases like this through adding or removing rules to the norm base.

When playing "vegan" Pac-Man, we may encounter the case depicted in Fig. 2(a). In absence of additional information Pac-Man will eat whichever ghost

**Fig. 2.** Pac-Man trapped between two ghosts (a) or in a corner (b). In (c) Pac-Man consumes the power pellet and eats the ghost at the same time.

the policy indicates it should, and a violation report is generated. Each violation report is saved as a timestamped file accompanied with the representation of the current game state. This report can be used to retroactively examine the context in which violations occur, and we can thereby revise our norm base which is independent from the agent's RL policy. In the case of "vegan" Pac-Man, these reports make it clear that this version of the game will be susceptible to somewhat regular violations in the form of Fig. 2(a).

If we consider instead "vegetarian" Pac-Man, we can restrict our norm base to the vegan rule only applied to the blue ghost. However, situations in which compliance is not possible can still occur; for instance the one depicted in Fig. 2(b), or the case where Pac-Man consumes a power pellet and the blue ghost at the same time, as shown in Fig. 2(c). In the latter case, the violation occurs because, prior to Pac-Man's consumption of the power pellet, the blue ghost is not scared and Pac-Man's strategy to comply with vegan will not be triggered. This is roughly analogous to an agent committing an unethical act because it has no way of recognizing that it is unethical.

Summarily, the violation reports show that there are four points in the maze where Pac-Man, potentially, cannot comply, given the information he has access to; in response, we add a norm danger steering Pac-Man away from these areas:

# ⇒<sup>O</sup> ¬enterpacman,danger

which is accompanied by constitutive norms defining the abstract action of "entering danger" (for some pre-defined location denoted as danger), such as:

inNorthRangepacman,danger, inRangeghost,danger ⇒<sup>O</sup> ¬North

#### **4 Evaluation and Conclusion**

We have presented a modular and transparent approach that enables an autonomous agent in pursuing ethical goals, while still running an RL policy that maximizes its cumulative reward. Our approach was evaluated on six tests<sup>3</sup>, in batches of 100 games. The results are displayed in the following table and discussed below; we give data on both game performance (average score and % games won) and ethical performance (ghosts eaten). Refer to Sec. 2 for a thorough description of the testing environment.

The first two baseline tests measured the performance of Pac-Man using two different (ethically agnostic) RL policies without the normative supervisor; this establishes a baseline for Pac-Man's game performance. We refer to the first

<sup>3</sup> We use a laptop with Intel i5-8250U CPU (4 cores, 1.60 GHz) and 8GB RAM, running Ubuntu 18.04, Java 8, Python 2.7.


RL policy (in Test 1a) as safe because the algorithm used to train it does not differentiate between regular ghosts and scared ghosts, learning how to avoid them altogether. We refer to the other RL policy (in Test 1b) as hungry because the corresponding algorithm differentiates between regular ghosts and scared ghosts, and the agent learns how to eat the scared ghosts. The results for Test 1b (average score of 1503.5 maximum score of 2133) were comparable to the baseline version in [14] (average score of 1675.9, max score of 2144).

Tests 2a, 2b, 3, and 4 make use of the normative supervisor. In 2a and 2b, we subject Pac-Man to a "vegan" norm base, prohibiting eating all ghosts (for both the safe and hungry policies respectively). The results obtained for test 2a were comparable to those in [14]: the average number of violations was the same in both tests (0.03 ghosts), and our average score was only slightly smaller (1193.39 instead of 1268.5). Compared with the baseline, the game performance did not suffer. For test 2b we obtained instead full compliance. Test 3 and 4 both use the hungry policy. In test 3 we subject Pac-Man to a "vegetarian" norm base, where only eating blue ghosts is forbidden. Allowing Pac-Man to eat one of the ghosts allows him to further maximize his score and avoid the violations depicted in Fig. 2(a). Test 4 addresses the two edge cases of non-compliance occurring in Test 3 as depicted in Fig. 2(b) and Fig. 2(c) by adding the new rules defined in Sec. 3.4, steering Pac-Man away from entering the "dangerous" areas. Here, violations were completely eliminated.

These tests, along with the analysis of the violation reports created in noncompliant cases, yielded several insights. The module did not cause Pac-Man's game performance to suffer, and could successfully identify non-compliant behaviour. It implemented compliant behaviour in most cases, with the exception of situations where compliance was not possible. The violation reports allowed us to identify such situations with ease.

The game used in this paper offers limited opportunities to work with meaningful (ethical) norms. We aim to explore alternative case studies with more options to define multiple (and possibly conflicting) ethical goals to test the interactions between RL and a normative supervisor based on DDPL.

# **References**


6381. International Joint Conferences on Artificial Intelligence Organization (7 2019). https://doi.org/10.24963/ijcai.2019/891, https://doi.org/10.24963/ijcai.2019/891


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Automatically Building Diagrams for Olympiad Geometry Problems**

Ryan Krueger<sup>1</sup> , Jesse Michael Han2, and Daniel Selsam<sup>3</sup>

<sup>1</sup> University of Oxford, Oxford, UK <sup>2</sup> University of Pittsburgh, Pittsburgh PA, USA <sup>3</sup> Microsoft Research, Redmond WA, USA

**Abstract.** We present a method for automatically building diagrams for olympiad-level geometry problems and implement our approach in a new open-source software tool, the Geometry Model Builder (GMB). Central to our method is a new domain-specific language, the Geometry Model-Building Language (GMBL), for specifying geometry problems along with additional metadata useful for building diagrams. A GMBL program specifies (1) how to parameterize geometric objects (or sets of geometric objects) and initialize these parameterized quantities, (2) which quantities to compute directly from other quantities, and (3) additional constraints to accumulate into a (differentiable) loss function. A GMBL program induces a (usually) tractable numerical optimization problem whose solutions correspond to diagrams of the original problem statement, and that we can solve reliably using gradient descent. Of the 39 geometry problems since 2000 appearing in the International Mathematical Olympiad, 36 can be expressed in our logic and our system can produce diagrams for 94% of them on average. To the best of our knowledge, our method is the first in automated geometry diagram construction to generate models for such complex problems.

## **1 Introduction**

Automated theorem provers for Euclidean geometry often use numerical models (i.e. diagrams) for heuristic reasoning, e.g. for conjecturing subgoals, pruning branches, checking non-degeneracy conditions, and selecting auxiliary constructions. However, modern solvers rely on diagrams that are either supplied manually [7,24] or generated automatically via methods that are severely limited in scope [12]. Motivated by the IMO Grand Challenge, an ongoing effort to build an AI that can win a gold medal at the International Mathematical Olympiad (IMO), we present a method for expressing and solving olympiad-level systems of geometric constraints.

Historically, algebraic methods are the most complete and performant for automated geometry diagram construction but suffer from degenerate solutions

<sup>©</sup> The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 88 https://doi.org/10.1007/978-3-030-79876-5 33 577–588, 2021.

Fig. 1: An example GMBL program and corresponding diagram generated by the GMB for IMO 2010 Problem 2.

and, in the numerical case, non-convexity. These methods are restricted to relatively simple geometric configurations as poor local minima arise via large numbers of parameters. Moreover, degenerate solutions manifest as poor distributions for the vertices of geometric objects (e.g. a non-sensical triangle) as well as intersections of objects at more than one point (e.g. lines and circles, circles and circles).

We constructed a domain-specific language (DSL), the Geometry Model-Building Language (GMBL), to express geometry problems whose semantics induce tractable numerical optimization problems. The GMBL includes a set of commands with which users introduce geometric objects and constraints between these objects. There is a direct interpretation from these commands to the parameterization of geometric objects, the computation of geometric quantities from existing ones, and additional numerical constraints. The GMBL employs root selector declarations to disambiguate multiple solution problems, reparameterizations both to reduce the number of parameters and increase uniformity in model variance, and joint distributions for geometric objects that are susceptible to degeneracy (i.e. triangles and polygons). Our DSL treats points, lines, and circles as first-class citizens, and the language can be easily extended to support additional high-level features in terms of these primitives.

We provide an implementation of our method, the Geometry Model Builder (GMB), that compiles GMBL programs into Tensorflow computation graphs [1] and generates models via off-the-shelf, gradient-based optimization. Figure 2 demonstrates an overview of this implementation. Experimentally, we find that the GMBL sufficiently reduces the parameter space and mitigates degeneracy to make our target geometry amenable to numerical optimization. We tested our method on all IMO geometry problems since 2000 (n = 39), of which 36 can be expressed as GMBL programs. Using default parameters, the GMB finds a single model for 94% of these 36 problems in an average of 27.07 seconds. Of the problems for which our program found a model and the goal of the problem could be stated in our DSL, the goal held in the final model 86% of the time.

All code is available on GitHub<sup>4</sup> with which users can write GMBL programs and generate diagrams. Our program can be run both as a command-line tool for integration with theorem provers or as a locally-hosted web server.

## **2 Background**

Here we provide an overview of olympiad-level geometry problem statements, as well as several challenges presented by the associated constraint problems.

#### **2.1 Olympiad-Level Geometry Problem Statements**

IMO geometry problems are stated as a sequential introduction of potentiallyconstrained geometric objects, as well as additional constraints between entities. Such constraints can take one of two forms: (1) geometric constraints describe the relative position of geometric entities (e.g. two lines are parallel) while (2) dimensional constraints enforce specific numerical values (e.g. angle, radius). Lastly, problems end with a goal (or set of goals) typically in the form of geometric or dimensional constraints. The following is an example from IMO 2009:

Let ABC be a triangle with circumcentre O. The points P and Q are interior points of the sides CA and AB, respectively. Let K, L, and M be the midpoints of the segments BP, CQ, and P Q, respectively, and let Γ be the circle passing through K, L, and M. Suppose that the line P Q is tangent to the circle Γ. Prove that OP = OQ.

(IMO 2009 P2)

This problem introduces ten named geometric objects and has a single goal.

Note that this class of problems does not admit a mathematical description but rather is defined empirically (i.e. as those problems selected for olympiads). The overwhelming majority of these problems are of a particular type – plane geometry problems that can be expressed as problems in nonlinear real arithmetic (NRA). However, while NRA is technically decidable, olympiad problems tend to be littered with order constraints and complex constructions (e.g. mixtilinear incenter) and be well beyond the capability of existing algebraic methods. On the other hand, they are selected to admit elegant, human-comprehensible proofs. It is this class of problems for which the GMBL was designed to express; though rare, any particular olympiad geometry problem is not guaranteed to be of this type and therefore is not necessarily expressible in the GMBL.

#### **2.2 Challenge: Globally Coupled Constraints**

A na¨ıve approach to generate models would incrementally instantiate objects via their immediate constraints. For (IMO 2009 P2), this would work as follows:

1. Sample points A, B, and C.

<sup>4</sup> https://github.com/rkruegs123/geo-model-builder

Fig. 2: An overview of our method. Our program takes as input a GMBL program and translates it to a set of real-valued parameters and differentiable losses in the form of a static computation graph. We then apply gradient-based optimization to obtain numerical models and display them as diagrams.


Immediately we see a problem – there is no guarantee that P Q is tangent to Γ in the final model. Indeed, the constraints of (IMO 2009 P2) are quite globally coupled – the choice of P partially determines the circle Γ to which P Q must be tangent, and every choice of ΔABC does not even admit a pair P and Q satisfying this constraint. This is an example of the frequent non-constructive nature of IMO geometry problems. When there is no obvious reparameterization to avoid downstream effects, all constraints must be considered simultaneously rather than incrementally or as a set of smaller local optimization problems.

#### **2.3 Challenge: Root Resolution**

Even in the constructive case, local optimization is not necessarily sufficient given that multiple solutions can exist for algebraic constraints. More specifically, two circles or a circle and a line intersect at up to two distinct points and in a problem that specifies each distinct intersection point, the correct root to assign is generally not locally deducible. Without global information, this can lead to poor initializations becoming trapped in local minima. The GMBL accounts for this by including a set of explicit root selectors as described in Section 3.3. These root selectors provide global information for selecting the appropriate point from a set of multiple solutions to a system of equations.

## **3 Methods**

In this section we present the GMBL and GMB in detail. In our presentation, we make use of the following notation and definitions:


#### **3.1 GMBL: Overview**

The GMBL is a DSL for expressing olympiad-level geometry problems that losslessly induces a numerical optimization problem. It consists of four commands, each of which has a direct interpretation regarding the accumulation of (1) realvalued parameters and (2) differentiable losses in terms of these parameters:


Table 1 provides a summary of their usage. The GMBL includes an extensible library of functions and predicates with which commands are written. Notably, this library includes a notion of root selection to explicitly resolve the selection of roots to systems of equations with multiple solutions.

#### **3.2 GMBL: Commands**

In the following, we describe in more detail the usage of each command and their roles in constructing a tractable numerical optimization problem.

param accepts as arguments a string, a type, and an optional parameterization. This introduces a geometric object that is parameterized either by the default parameterization for <type> or by the supplied method. Each primitive geometric type has the following default parameterization:


Optional parameterizations embody our method's use of reparameterization to decrease the number of parameters and increase model diversity. For example, consider a point C on the line ←→AB that is subject to additional constraints. Rather than optimizing over the x- and y-coordinates of C, we can express C in terms of a single value z that scales C's placement on the line ←→AB.

In addition to the standard usage of param outlined above, the GMBL includes an important variant of this command to introduce sets of points that form triangles and polygons. This variant accepts as arguments (1) a list of point names, and (2) a required parameterization (see Table 1). This joint parameterization of triangles and polygons further prevents degeneracy. For example, to initialize a triangle ΔABC, we can sample the vertices from normal distributions with means at distinct thirds of the unit circle. This method minimizes the sampling of triangles with extreme angle values, as well as allows for explicit control over the distribution of acute vs. obtuse triangles by adjusting the standard deviations. Appendix C includes a list of all available parameterizations.<sup>5</sup>

<sup>5</sup> All appendices can be found in the long version of this paper [15].


Table 1: An overview of usage for the four commands.

define accepts as arguments a string, a type, and a value that is one of <point>, <line>, or <circle>. This command serves as a basic assignment operator and is useful for caching commonly used values. The functions described in Section 3.3 are used to construct <value> from existing geometric objects.

assert accepts a single predicate and imposes it as an additional constraint on the system. This is achieved by translating the predicate to a set of algebraic values and registering them as losses. This command does not introduce any new geometric objects and can only refer to those already introduced by param or define. Notably, dimensional constraints and negations are always enforced via assert. Detail on supported predicates is presented in Section 3.3.

eval, like assert, accepts a single predicate and therefore does not introduce any new geometric objects. However, unlike assert, the corresponding algebraic values are evaluated and returned with the final model rather than registered as losses and enforced via optimization. This command is most useful for those interested in integrating the GMBL with theorem provers.

#### **3.3 GMBL: Functions and Predicates**

The second component of our DSL is a set of functions and predicates for constructing arguments to the commands outlined above. Functions construct new geometric objects and numerical values whereas predicates describe relationships between them. Our DSL includes high-level abstractions for common geometric concepts in olympiad geometry (e.g. excircle, isotomic conjugate).

Functions in the GMBL employ a notion of root selectors to address the "multiple solutions problem" described in Section 2.3. In plane geometry, this problem typically manifests with multiple candidate point solutions, such as the intersection between a line and a circle. Root selectors control for this by allowing users to specify the appropriate point for functions with multiple solutions. Figure 3 demonstrates their usage in the functions inter-lc (intersection of a line and circle) and inter-cc (intersection of two circles).

Importantly, arguments to predicates and functions can be specified with functions rather than named geometric objects. For a list of supported functions, predicates, and root selectors, refer to Appendices A, B, and C, respectively.

Fig. 3: An example usage of root selectors to resolve the intersections of lines and circles, and circles and circles.

#### **3.4 Auxiliary Losses**

The optimization problem encoded by a GMBL progran includes three additional loss values. Foremost, for every instance of a circle intersecting a line or other circle, we impose a loss value that ensures the two geometric objects indeed intersect. The final two, albeit opposing losses are intended to minimize global degeneracy. We impose one loss that minimizes the mean of all point norms to prevent exceptionally separate objects and a second to enforce a sufficient distance between points to maintain distinctness.

#### **3.5 Implementation**

We built the GMB, an open-source implementation that compiles GMBL programs to optimization problems and generates models. The GMB takes as input a GMBL program and processes each command in sequence to accumulate realvalued parameters and differentiable losses in a Tensorflow computation graph. After registering auxiliary losses , we apply off-the-shelf gradient-based local optimization to produce models of the constraint system. In summary, to generate N numerical models, our optimization procedure works as follows:


Our program accepts as arguments (1) the # of models desired (default = 1), (2) the # of initializations to sample (default = 10), and (3) the max # of optimization tries (default = 3). Our program also accepts the standard suite of parameters for training a Tensorflow model, including an initial learning rate (default = 0.1), a decay rate (default = 0.7), the max # of iterations (default = 5000), and an epsilon value (default = 0.001) to determine stopping criteria.

Table 2: An evaluation of our method's ability to generate a single model for each of the 36 IMO problems encoded in our DSL. For each problem, 10 sets of initial parameters were sampled over which our program optimized up to three. All data shown are the average of three trials. The first row demonstrates results using default parameters ( = 0.001, learning rate = 0.1, # iterations = 5, 000).


## **4 Results**

In this section, we present an evaluation of our method's proficiency in three areas of expressing and solving olympiad-level geometry problems:


Table 2 contains a summary of our results.

Our evaluation considers all 39 IMO geometry problems since 2000. Of these 39 problems, 36 can be expressed in our DSL. Those that we cannot encode involve variable numbers of geometric objects. For 32 of these 36 problems, we can express the goals as eval commands in the corresponding GMBL programs. The goals of the additional four problems are not expressible in our DSL, e.g. our DSL cannot express goals of the form "Find all possible values of ∠ABC."

To evaluate (2) and (3), we conducted three trials in which we ran our program on each of the 36 encodings with varying sets of arguments. With default arguments, our program generated a single model for (on average) 94% of these problems. Our program ran for an average of 27.07 seconds for each problem but there is a stark difference between time to success and time to failure (14.72 vs 223.51 seconds) as failure entails completing all optimization attempts whereas successful generation of a model terminates the program. We achieve similar success rates with more forgiving training arguments or a higher tolerance.

For use in automated theorem proving, it is essential that models generated by our tool not only satisfy the constraint problem up to tolerance but also any other truths that follow from the set of input constraints. The most immediate example of such a truth is the goal of a problem statement. Therefore, we used the goals of IMO geometry problems as a proxy for this ability by only checking the satisfaction of the goal in the final model (i.e. with an eval statement) rather than directly optimizing for it. In our experiments, we considered such a goal satisfied if it held up to ∗ 10 as it is reasonable to expect slightly higher floating-point error without explicit optimization. Using default parameters, the goal held up to tolerance in 86% of problems for which we found a model and could express the goal. This rate was similar across all other sets of arguments.

#### **5 Future Work**

Here we discuss various opportunities for improvement of our method.

Firstly, improvements could be made to our method of numerical optimization. While Tensorflow offers a convenient way of caching terms via a static computation graph and optimizing directly over this representation, there is not explicit support for constrained optimization. Because of this, arbitrary weights have to be assigned to each loss value. Though rare, this can result in false positives and negatives for the satisfaction of a constraint. Using an explicit constrained-optimization method (e.g. SLSQP) would enable the separation of soft constraints (e.g. maximizing the distance between points) and hard constraints (e.g. those enforced by assert), removing the need for arbitrary weights.

Secondly, cognitive overhead could be reduced as users are currently required to determine degrees of freedom; it would be far easier to write problem statements using only declarations of geometric objects and constraints between them, e.g. using only assert. This could be accomplished by treating our DSL as a low-level "instruction set" to which a higher-level language could be compiled. The main challenge of such a compiler would be appropriately identifying opportunities to reduce the degrees of freedom. To achieve this, the compiler would require a decision procedure for line and circle membership.

Lastly, we could improve our current treatment of distinctness. To prevent degenerate solutions, our method optimizes for object distinctness and rejects models with duplicates. However, there is the occasional problem for which a local optimum encodes two provably distinct points as equal up to floating point tolerance. There are many techniques that could be applied to this problem (e.g. annealing) though we do not consider them here as the issue is rare.

#### **6 Related Work**

Though many techniques for mechanized geometry diagram construction have been introduced over the decades, no method, to the best of our knowledge, can produce models for more than a negligible fraction of olympiad problems. There exist many systems, built primarily for educational purposes, for interactively generating diagrams using ruler-and-compass constructions, e.g. GCLC [13], GeoGebra [11], Geometer's Sketchpad [20], and Cinderella [19]. There are also noninteractive methods for deriving such constructions, e.g. GeoView [2] and program synthesis [9, 12]. However, as discussed in Section 2.2, very few olympiad problems can be described in such a form. Alternatively, Penrose is an earlystage system for translating mathematical descriptions to diagrams that relies

on constrained numerical optimization and therefore does not suffer from this expressivity limitation [25]. However, this system lacks support for constraints with multiple roots, e.g. intersecting circles. There are more classical methods that similarly depart from constructive geometry. MMP/Geometer [8] translates the problem to a set of algebraic equations and uses numerical optimization (e.g. BFGS) and GEOTHER [22, 23] first translates a predicate specification into polynomial equations, decomposes this system into representative triangular sets, and obtains solutions for each set numerically. Neither of these programs are available to evaluate though we did test similar approaches using modern libraries (specifically: sympy [17] and scipy [21]) and both numerical and symbolic methods would almost always timeout on relatively simple olympiad problems.

Generating models for systems of geometric constraints is also a challenge in computer-aided design (CAD) for engineering diagram drawing. Recent efforts focus on graph-based synthetic methods, a subset of techniques concerned with ruler-and-compass constructions [3,5,6,10,14,16,18]. Most relevant to our method are Bettig and Shah's "solution selectors" which, similar to root selectors in the GMBL, allow users to specify the configuration of a CAD model [4]. However, these solution selectors are purpose-built and do not generalize.

# **7 Conclusion**

It is standard in GTP to rely on diagrams for heuristic reasoning but the scale of automatic diagram construction is limited. To enable efforts to build a solver for IMO geometry problems, we developed a method for building diagrams for olympiad-level geometry problems. Our method is based on the GMBL, a DSL for expressing geometry problems that induces (usually) tractable numerical optimization problems. The GMBL includes a set of commands that have a direct interpretation for accumulating real-valued parameters and differentiable losses. Arguments to these commands are constructed with a library of functions and predicates that includes notions of root selection, joint distributions, and reparameterizations to minimize degeneracy and the number of parameters. We implemented our approach in an open-source tool that translates GMBL programs to diagrams. Using this program, we evaluated our method on all IMO geometry problems since 2000. Our implementation reliably produces models; moreover, known truths that are not directly optimized for typically hold up to tolerance. By handling configurations of this complexity, our system clears a roadblock in GTP and provides a critical tool for undertakers of the IMO Grand Challenge.

# **References**

1. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **The Fusemate Logic Programming System**

Peter Baumgartner

Data61/CSIRO and The Australian National University, Canberra, Australia Peter.Baumgartner@data61.csiro.au

**Abstract.** Fusemate is a logic programming system that implements the possible model semantics for disjunctive logic programs. Its input language is centered around a weak notion of stratification with comprehension and aggregation operators on top of it. Fusemate is implemented as a shallow embedding in the Scala programming language. This enables using Scala data types natively as terms, a tight interface with external systems, and it makes model computation available as an ordinary container data structure constructor. The paper describes the above features and implementation aspects. It also demonstrates them with a non-trivial use-case, the embedding of the description logic ALCIF into Fusemate's input language.

# **1 Introduction**

Fusemate1 is a logic programming system for computing possible models of disjunctive logic programs [23,24]. A Fusemate logic program consists of (typically) non-ground if-then rules with stratified default negation in the body [21]. Stratification entails that a true default-negated body literal remains true in the course of deriving new conclusions.

Fusemate was introduced in [7] for modelling systems that evolve over time and for analysing their current state based on the events so far. Such tasks are often subsumed under the terms of stream processing, complex event recognition, and situational awareness, and have been addressed (also) with logic-based approaches [2,9,4,5].

To my knowledge, Fusemate is unique among all these and other logic programming systems [12,1,13,26,16] (and theorem provers) in the way it is implemented. Fusemate is implemented by shallow embedding in a full-fledged programming language, Scala [25]. Essentially, the user writes a syntactically sugared Scala program utilizing familiar logic programming notation, and the program's execution returns models. This has advantages and disadvantages. The main disadvantages is that it is more difficult to implement performance boosting measures like term indexing. The main advantage is that interfacing with data structure libraries and with external systems is easy, an aspect whose importance has been emphasized for virtually all of the above systems. In fact, Fusemate is motivated in parts by exploring how far the embedding approach can be pushed and to what benefit.

The earlier Fusemate paper [7] focused on the model computation calculus with a belief revision operator as the main novelty. It utilized a certain notion of *stratification*

© The Author(s) 2021

<sup>1</sup> Fusemate is available at https://bitbucket.csiro.au/users/bau050/repos/fusemate/.

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5\_34 589–601, 2021.

*by time (SBT)* for making the calculus effective and useful in the intended application areas. This system description focuses on the advantages of the shallow embedding approach as boosted by new language features introduced here. These new language features are (a) non-standard comprehension and aggregation operators, among others, and (b) a weaker notion of *stratification by time and predicates (SBTP)*. In brief, SBTP is a lexicographic combination of stratification by time and the standard stratification in terms of the call-graph of the program. Section 5 has an example that demonstrates the need for (a) and (b) in combination, and Section 4 discusses the shallow embedding approach and its advantages on a more general level.

Here is an excerpt from a Fusemate program that previews some of the new features:


The scenario comprises traffic lights identified by numbers 1 to 10 (line 2). In the course of time the traffic lights change their colors, and each such event is recorded as a corresponding Change atom (line 3). The rule on line 6 computes a State at a current time Now(time) as a snapshot of the current colors of all traffic lights. For that, the comprehension Change(t <**=**time,id,color) on line 6 finds the latest Change event before or at time for a fixed id chosen from allIds, and binds that time to the (unused) variable t. A FullState aggregates the separate State facts at a time partitioned as (Scala) sets of ids of "drive" and "stop" colors. In that, the **COLLECT** special form collects in a Scala List-typed variable the specified terms that satisfy the body behind **STH**. Notice that all atoms in FullState refer to the same time, yet the program is SBTP because State comes before FullState in predicate stratification. (Predicate stratification is computed automatically by Fusemate with Tarjan's algorithm.) The rule on line 10 demonstrates the use of the Scala Set method size in the body. Line 11 demonstrates the use of default negation in combination with comprehension. When applied to a given sequence of Change events, Fusemate computes models, one-at-a-time, each as Scala set of atoms.

## **2 Fusemate Programs**

For the purpose of this paper, a brief summary of the syntactic notions underlying Fusemate programs is sufficient; see [7] for details. Terms and atoms of a given signature are defined as usual. Let *var*() denote the set of variables occurring in an expression . We say that is *ground* if *var*() = ∅. We write for applying a substitution to . The domain of is denoted by *dom*(). A substitution is a *grounding substitution for* iff *dom*() = *var*() and is ground. In this case we simply say that *is for* .

Let T be a countably infinite discrete set of *time points* equipped with a total strict ordering < ("earlier than"), e.g., the integers. Assume that the time points, comparison operators = and ≤, and a successor time function +1 are part of the signature and interpreted in the intended way. A *time term* is a (possibly non-ground) term over the sub-signature T ∪ {+1}.

The signature may contain other "built-in" predicate and function symbols for predefined types such as strings, arithmetic data types, sets, etc. We only informally assume that all terms are built in a well-sorted way and that built-in operators over ground terms can be evaluated effectively.

An *ordinary atom (with time term )* is of the form (,1,...,) where is an ordinary predicate (i.e., neither a time predicate nor built-in), is a time term and 1,..., terms. A *(Fusemate) rule* is an implication written in Prolog-like syntax as

$$H \text{ : } b\_1, \dots, b\_k, \text{not } \vec{b}\_{k+1}, \dots, \text{not } \vec{b}\_n \text{ : } \tag{1}$$

In (1), a rule *head* is either (a) a disjunction ℎ<sup>1</sup> ∨···∨ ℎ of ordinary atoms, for some ≥ 1, or (b) the expression **fail**.2 In case (a) the rule is *ordinary* and in case (b) it is a *fail rule*. A rule *body* , the part to the right of :-, is defined by mutual recursion as follows. A *positive body literal* is one of the following: (a) an ordinary atom, (b) a *comprehension atom (with time term )* of the form ( ◦ ,1,...,) **sth** , where is a variable, ◦∈{<, ≤, >, ≥} and is a body, (c) a built-in call , i.e., an atom with a built-in predicate symbol, or (d) a *special form* **let**(,), **choose**(, *ts*), **match**(, ) or **collect**(, **sth** ) where is a variable, , are terms, *ts* is a list of terms, and is a body. A *positive body* is a list = 1,..., of positive body literals with ≥ 0. If = 0 then is *empty* otherwise it is *non-empty*. A *negative body literal* is an expression of the form **not** , where is a non-empty positive body. A *body* is a list = 1,..., , **not** +1,..., **not** comprised of a (possibly empty) positive body and (possibly zero) negative body literals. It is *variable free* if *var*(1,..., ) = ∅.

Let be a rule (1). We say that is *range-restricted* iff *var*() ⊆ *var*(). Compared to the usual notion of range-restrictedness [18], Fusemate rules may contain extra variables in negative body literals. For example, p(, ) :- q(, ), **not**( < , r(, , )) is range-restricted in our sense with extra variables and . The extra variables are implicitly existentially quantified within the **not** expression. The example corresponds to the formula q(, )∧¬∃, .(<∧r(, , )) → p(, ). Semantically and operationally this will cause no problems thanks to stratification, introduced next.

Fusemate programs – sets of rules – need to be "stratified by time and by predicates" (SBTP). The standard notion of stratification by predicates means that the call graph of the program contains no cycles going through negative body literals. The edges of this call graph are the "depends on" relation between predicate symbols such that positively (negatively) depends on if there is a rule with a -atom in its head and a -atom in its positive (negative) body. For disjunctive heads, all head predicates are

<sup>2</sup> This definition of head is actually simplified as Fusemate offers an additional head operator for belief revision, see [7]. This is ignored here.

defined to depend positively on each other. Every strongly connected component of the call graph is called a stratum, and in predicate stratified programs negative body literals can occur only in strata lower than the head stratum.

SBTP is defined as follows: for every rule (1) in a given program, (a) there is a variable *time* that is the time term of some ordinary ∈ , (b) if is an ordinary head then every head literal must have a time term constrained to be ≥ than *time*, and (c) for all rule bodies occurring in the rule:


For the purpose of this paper we only informally assume that all rules contain constraints for enforcing the required time ordering properties. There are similar stratification requirements for comprehension atoms and special forms so that their evaluation satisfies the counterpart of condition (ii) (see below for **collect**). A fully formal definition could be given by modifying the spelled-out definition of SBT in [7].

As an example, if r belongs to a lower stratum than p then the following five rules all are SBTP, while only the first two rules are SBT.

$$\mathfrak{p}(time, \mathbf{x}) : \mathfrak{q}(time, \mathbf{x}), \mathfrak{r}(t, \mathbf{y}), t \le time \tag{2}$$

$$\mathfrak{p}(time, \mathbf{x}) : \mathfrak{q}(time, \mathbf{x}), \mathfrak{not}(r(t, \mathbf{y}), t < time) \tag{3}$$

$$\mathfrak{p}(time, \mathbf{x}) \text{ : } \mathfrak{q}(time, \mathbf{x}), \mathfrak{not}(r(t, \mathbf{y}), t \le time) \tag{4}$$

$$\mathfrak{p}(time+1,\mathfrak{x}) : \mathfrak{q}(time,\mathfrak{x}), \mathfrak{not}(r(t,\mathfrak{y}),t \le time) \tag{5}$$

$$\mathfrak{p}(time, \mathbf{x}) \text{ : } \mathfrak{q}(time, \mathbf{x}), (\mathfrak{p}(t < time, \mathbf{y}) \, \mathfrak{sth} \, q(t, \mathbf{y})), \mathfrak{r}(t, \mathbf{y}) \tag{6}$$

Finally, a *(Fusemate) program* is a set of range-restricted rules that is SBTP.

# **3 Model Computation**

The possible model semantics of disjunctive logic programs [23,24] associates to a given disjunctive program a certain set of normal programs (i.e., without disjunctive heads) and takes the intended model(s) of these normal programs as the possible models of the given program. These "split" programs represent all possible ways of making one or more head literals true, for every disjunctive rule. As a propositional example, the program { :- , ∨ :- , :- } is associated to the split programs { :- , :- } and { :- , :- , :- }. The possible models, hence, are {, } and {, , }

Fusemate computes possible models by bottom-up fixpoint computation and dynamic grounding the program rules in the style of hyper tableaux [8]. The model computation procedure is implemented as a variant of the well-known given-clause algorithm, which seeks to avoid deriving the same conclusion from the same premises twice. It exhausts inferences in an outer loop/inner loop fashion according to the given program's stratification by time and by predicates. The main data structure is a set of paths, where each path represents a partial model candidate computed so far (see [7] for more details). Paths are selected, extended, split and put back into the set until exhausted, for a depth-first, left-to right inference strategy. Paths carry full status information, which is instrumental for implementing incrementality, such that facts with current or later time can be added at any stage without requiring model recomputation from scratch. This, however, necessitated keeping already exhausted paths for continued inferences later.

The proof procedure's core operation is computing a *body matcher*, i.e., a substitution for a rule's positive body variables so that the rule body becomes satisfied in the current partial model candidate. Formally, let be a set of ordinary ground atoms, representing the obvious interpretation that assigns true to exactly the members of . Let be a body. A *body matcher for* is a substitution for the positive body of , written as , |= , such that the following holds (, means the sequence of head and rest body ):

, |= ( is the empty body and is the empty substitution) , |= , iff is for , ∈ and , |= , with ordinary atom , <sup>|</sup><sup>=</sup> (( <sup>≤</sup> < ,1,...,) **sth**), iff is for (,1,...,) and (1) (,1,...,) ∈ , < and , |= for some , (2) there is no for (,1,...,) and no such that (,1,...,) <sup>∈</sup> , < <sup>≤</sup> < and , |= , and (3) , |= , |= , iff evaluates to true and , |= where is ground built-in , |= **let**(,), iff = [ ↦→ ] and , |= , |= **choose**(, *ts*), iff = [ ↦→ ] and , |= for some ∈ *ts* , |= **match**(, ), iff is for , = and , |= , |= **collect**(, **sth**), iff = [ ↦→ { | , |= }] and , |= , |= **not** , iff there is no such that , |= , and , |=

A *comprehension atom* ( ◦ ,1,...,) **sth** stands for the subset of all ground -instances in such that is satisfied and with a time as close as possible to wrt. < or ≤. The cases for > and ≥ are dual and not spelled out above to save space. The **collect** special form collects in the variable the set of all instances of term such that the body is satisfied in . We require comprehension atoms and **collect**s to be used in a stratified way, so that their results do not change later in a derivation when is extended. The requirements are the same as with **not** and can be enforced by ordering constraints.

The definition above extends the earlier definition of body matchers in [7] with the new comprehension construct and the **let**, **choose**, **match**, **collect** operators. It now also enforces left-to-right evaluation of because the new binding operators depend on a fixed order guarantee to be useful. An example is the (nonsensical) body **CHOOSE** (x: Int, List(1, 2, 3)), **LET**(xxx: Int, 3\*x), xxx % 2 **==**0 which relies on this order. Undefined cases, e.g., when evaluation of a non-ground built-in is attempted, or when a binder variable has already been used before are detected as compile time syntax errors.

## **4 Shallow Embedding in Scala**

Fusemate is implemented as a shallow embedding into Scala [25]. It has three conceptual main components: a signature framework, a Scala compiler plugin, and an inference engine for fixpoint computation as explained in Section 3. The signature framework provides a set of Scala class definitions as the syntactical basis for writing Fusemate programs. It is parameterized in a type Time, which can be any Scala or Java type that is equipped with an ordering and an addition function for time increments, for example Int or java.time.OffsetDataTime. The programmer then refines an abstract class Atom of the Time-instantiated signature framework with definitions of predicate symbols and their (Scala-)sorted arities. See lines (3)-(5) in the program in the introduction for an example. These atoms then can be used in Fusemate rules, see lines (6)–(12) in the example.

While written in convenient syntax, rules are syntactically ill-formed Scala. This problem is solved by the compiler plugin, which intercepts the compilation of the input file at an early stage and transforms the rules into valid Scala source code.3 More precisely, a rule is transformed into a curried partial function that is parameterized in an interpretation context I. The curried parameters are Scala guarded pattern matching expression and correspond to the rule's positive body literals, in order. For example, the Faulty rule on lines (11) and (12), with the condition since < time ignored, for simplicity, is (roughly) translated into the function


```
4 Faulty(time, id, since) } }
```
Notice the renaming of repeated occurrences of the id variable, which is needed for the correct semantics. Notice also that a Scala Boolean-valued expression in an ordinary body literal position (e.g., t < time) simply becomes a guard in a pattern.

The code above can be understood with body matcher computation in mind. Suppose the inference engine selects an interpretation from the current set of paths. For exhausting on , the inference engine combinatorially chooses literals 1, <sup>2</sup> ∈ and collects the evaluation results of () (1) (2), if defined. Observe that by the transformation into Scala pattern matching, body matchers are only implicitly computed by the Scala runtime system. Each evaluation result, hence, is a body-matcher instantiated head.

The rule's negative body literal is translated into the code on line (3) and conjoined to the guard of the preceding ordinary literal. In general, a negative literal **NOT** *body* is treated by translating **FAIL** :- **NOT** *body* and evaluating the resulting Scala code on by means of the failsOn method. If **FAIL** is not derivable then **NOT** *body* is satisfied. Again, appropriate bindings for the variables bound outside of *body* are held implicitly by the Scala runtime system. The translation of the special forms and comprehension is not explained here for space reasons. Fusemate can show the generated code, though.

<sup>3</sup> Early experiments showed it is cumbersome and error-prone to write the Scala code by hand, so this was not an option. The compiler plugin is written in Scala and operates at the abstract syntax tree level. This was conveniently be done thanks to a sophisticated quasiquote mechanism.

#### **Properties and Advantages**

The shallow embedding approach enables introspection capabilities and interfacing between the rule language and the host language beyond what is implemented in other systems. In Fusemate, the terms of the logical language are nothing but Scala objects. As a consequence, any available Scala type or library data structure can be used as a built-in without extending an "interface" to an extension language – simply because there is none. Dually, the embedding of the rule language into the host language Scala is equally trivial because rules, atoms and interpretations are Scala objects, too.

It is this "closed loop" that makes an aggregation operator (**collect**) possible that returns a list of Scala objects as specified by the programmer, e.g., a list of terms or atoms.4 This list can be further analysed or manipulated by the rules. See the description logic embedding in Section 5, which critically depends on this feature. This introspection capability stands out in comparison to the logic programming systems mentioned in the introduction. For instance, aggregation in systems like DLV [1], and IDB [12] is limited to predefined integer-valued aggregates for sum, count, times, max and min.

Most logic programming systems can be called from a (traditional) host programming language and can call external systems or utilize libraries for data structures. The DLV system, for instance, interfaces with C++ and Python [22], Prova [16] with Java, and IDP with the Lua scripting language. Systems based on grounding (e.g., DLV and IDP) face the problem of "value invention" by external calls, i.e., having to deal with terms that are not part of the input specification [10].

The main issue, however, from the Fusemate perspective is that these systems' external interfaces are rather heavy-handed (boilerplate code, mapping logic terms to/from the host language, String representation of logic programs) and/or limited to a predefined set of data structures. In contrast, Fusemate's seamless integration with Scala encourages a more integrated and experimental problem solving workflow. The following Scala program demonstrates this point with the traffic light example:


From a workflow perspective, this program integrates Fusemate as a list operator (on a list of Change instances) in an otherwise unremarkable functional program.

<sup>4</sup> Technically, this is possible because the current interpretation is available in the rule body through the parameter I (see the transformation example above). One could directly access I, e.g., as in **CHOOSE**(a: atom, I), **MATCH**(State(t,3,c), a), t>10, c !**=**"red"

For a more realistically sized experiment I tried a combined Fusemate/Scala workflow for analysing the data of the DEBS 2015 Grand Challenge.5 The data comprises two millions taxi rides in New York City in terms of start/end times, and start/end GPS coordinates, among others. The problem considered was to detect anomalies where a taxi driver drivers away from a busy hotspot without a passenger. Solving the problem required clustering locations by pickup/drop-off activity for determining hotspots, and then analysing driver behavior given their pickups/drop-offs at these hotspots.

Two million data points were too much for Fusemate alone and required Scala preprocessing, e.g., for filling a grid abstraction of New York coordinates, data cleansing and filtering out little active drivers. Fusemate was used for computing clusters with rules similar to transitive closure computation. Input to Fusemate calls were Scala precomputed point clouds. The computed clusters were used to analyze Scala prefiltered taxi rides for anomaly detection based on the clusters. This involved three moderately complex rules, for first identifying gaps and then analysing them. The comprehension operator was useful to find "the most recent ride predating a given start", among others. The longest Fusemate run was 0.31sec for 64 rides (with 39 clusters fixed), most other runs took less than 0.15sec. Fusemate's performance was perfectly acceptable in this experiment thanks to a *combined* workflow.

# **5 Embedding Description Logic** ALCIF

ALCIF is the well-known description logic ALC extended with inverse roles and functional roles. (See [3] for background on description logics.) This section describes how to translate an ALCIF knowledge base to Fusemate rules and facts for satisfiability checking.

This is our example knowledge base, TBox on the left, ABox on the right:


The father role is declared as functional, i.e., as a right-unique relation, and father−<sup>1</sup> denotes its inverse "child" relation. The third GCI says that all children of a rich father are rich as well. In all models of the knowledge base Fred is Poor. This follows from the given fact that his child Anne is poor, functionality of father and the third CGI. However, there are models where Bob is Rich and models where Bob is Poor.

Translating description logic into rule-based languages has been done in many ways, see e.g. [20,17,14,11]. An obvious starting point is taking the FOL version of a given knowledge base. Concept names become unary predicates, role names become binary predicates, and GCIs (general concept inclusions) are translated into implications. By polynomial transformations, the implications can be turned into clausal form (if-then rules over literals), except for existential quantification in a positive context, which

<sup>5</sup> http://www.debs2015.org/call-grand-challenge.html

causes unbounded Skolem terms in derivations when treated naively (for example, the third CGI above is problematic in this sense). This is why many systems and also the transformation to Fusemate below avoid Skolemization.

The first GCI corresponds to the clause Person() → Rich() ∨ Poor(), and the second corresponds to the "almost" clause Person()→∃.(father(, ) ∧Person()). Fusemate works with the reified rule versions of these, with an IsA-predicate for concept instances, and a HasA-predicate for role instances. For the whole TBox one obtains the following, where RN stands for "role name" and CN stands for "concept name".6


Every GCI can be converted into rules like the above without problems. For that, starting from its NNF, ∃-quantifications in the premise of a rule can be expanded in place, and ∀-quantifications can be moved to the head as the ∃-quantification of the NNF of the negated formula. Similarly for negated concept names. See [20] for such transformation methods. The ABox is represented similarly. Its first element, for instance, is IsA(Name("Anne"), And2( CN("Person"), CN("Poor")), 0).

In addition, some more general "library" rules for the tableau calculus are needed:


```
10 HasA(x, r, rSuccOfx, time+1) AND IsA(rSuccOfx, c, time+1): @preds("TimePlus1") :- (
```

<sup>15</sup> IsA(y, c, time) :-


The expansion rules on lines 1 and 2 deal with the ALC binary Boolean connectives And2 and Or2 in the obvious way. Supposing NNF of embedded formulas, no other cases can apply. The remaining rules can be understood best with the standard tableau algorithm for ALCIN in mind, which includes blocking to guarantee termination. They follow the terminology in [6, Chapter 4]. The Neighbour relation abstracts from the HasA relation, left away for space reasons. The expansion rule for ∃ comes for three cases. The first case (line 5), for example, applies to non-functional roles as per the Scala builtin test on line 6. The expansion of the given ∃-formula only happens if it

<sup>6</sup> See the Fusemate web page for the full, runnable code.

is not yet satisfied and in a non-blocked situation (line 7). In this case the rule derives a Skolem object defined on line 8 for satisfying the ∃-formula. Notice the annotation **@**preds("TimePlus1") which makes sure that the head is on the highest stratum. This way, the rule will be applied after, in particular, the rules for blocking. Furthermore, with the time stamp time +1 the Skolem object is kept separate from the computations in the current iteration time. The blocking rules are defined as follows:


Some additional rules are needed for dealing with basic inconsistencies and for carrying over IsA and HasA facts between iterations. They are not shown here.

The expansion rules and blocking rules follow the tableau calculus description in [6, Chapter 4]. One important detail is that the expansion rule for ∃ must be applied with lowest priority. This is straightforward thanks to Fusemate's stratification and aggregation construct. Equally important is the access to (Scala) data structures via built-ins and using them as terms of the logical language. This made it easy to program Skolemization and the Label relation for collecting sets of concepts of an individual.

# **6 Conclusions**

This paper described recent developments around the Fusemate logic programming system. It included new technical improvements for a weaker form of stratification, which enabled useful aggregation and comprehension language constructs. It also argued for the advantages of the tight integration with Fusemate's host language, Scala, in terms of data structures and usability.

Answer set solvers like DLV and SModels are designed to solve NP-complete or higher complexity search problems as fast as possible. Fusemate is not motivated as a competitive such system, it is motivated for "well-behaved" knowledge representation applications, similarly to description logic reasoners, whose (often) NExpTime complete solving capabilities are not expected to be typically needed. (Some more work is needed, though, e.g., on improving the current term indexing techniques to speed up model computation.) More specifically, the main intended application of Fusemate is for the runtime analysis of systems that evolve over time. The taxi rides data experiment explained in Section 4 is an example for that. It suggests that Fusemate is currently best used in a combined problem solving workflow if scalability is an issue.

As for future work, the next steps are to make the description logic reasoner of Section 5 callable from within Fusemate rules in a DL-safe way [19] and to embed a temporal reasoning formalism. The event calculus [15] seems to be a good fit.

Acknowledgements. I am grateful to the reviewers for their helpful comments.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Twee: An Equational Theorem Prover**

Nicholas Smallbone

Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden nicsma@chalmers.se

**Abstract.** Twee is an automated theorem prover for equational logic. It implements unfailing Knuth-Bendix completion with ground joinability testing and a connectedness-based redundancy criterion. It came second in the UEQ division of CASC-J10, solving some problems that no other system solved. This paper describes Twee's design and implementation.

**Keywords:** Automated theorem proving · unit equality · completion

## **1 Introduction**

Twee is an automated theorem prover for equational logic, available as opensource software [17]. It features good performance (coming second in the UEQ division of CASC-J10), low memory use, and human-readable proof output.

Twee's general architecture is quite traditional: it uses a DISCOUNT loop [7] implementing unfailing Knuth-Bendix completion [3]. However, it has a few characteristics which are unusual in a high-performance theorem prover:

*Fixed heuristics.* Twee does not adjust its strategy based on the input problem. It uses a fixed term order, a fixed critical pair scoring function, and so on. Rather than detecting the kind of problem, Twee uses general-purpose strategies that work for all sorts of problems (Section 2).

*Strong redundancy tests.* Rather than using special strategies for associativecommutative functions, Twee builds in strong redundancy tests, based on ground joinability and connectedness (Section 3). These handle not just AC functions but many kinds of unorientable equations, in particular permutative ones (where both sides are almost the same but with variables in a different order).

*A high-level language.* Twee consists of 5300 lines of Haskell code, whereas for example Waldmeister [12] is 65000 lines of C. As such, it is easy to experiment with. Despite the choice of programming language, Twee is quite fast at raw deduction steps, thanks to careful coding of low-level term operations (Section 4).

Despite the fixed heuristics and high-level language, Twee comes close in performance to E [14] and Waldmeister [12]. It is strong in many problem classes,

c The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. https://doi.org/10.1007/978-3-030-79876-5 35 602–613, 2021. 3

including LAT (lattices) and REL (relation algebra) from TPTP, which feature many commutative operators where Twee's redundancy tests shine, and on unusual problems, where no prover has special heuristics. Twee is however poor at RNG (rings), where it seems important to choose a good term order. The rest of the paper describes Twee's design in detail, focusing on the three aspects above.

*Notation.* We use t " u to mean that t and u are syntactically equal.

#### **2 Architecture**

Twee natively supports only unit equality problems with ground goals, but the frontend also supports arbitrary quantification, Horn formulas, and many-sorted logic. These features are eliminated using the external tool Jukebox [16], which:


At this point, the goal can still contain existentially-quantified variables, which must be eliminated. To do so, we use an old trick, also used by Waldmeister: if the goal t " u is non-ground, we add new function symbols *eq*, *true* and *false*, and two axioms @X. *eq*pX, Xq " *true* and *eq*pt, uq " *false*, and replace the goal with *true* " *false*. Now we have a unit equality problem with a ground goal.

The main proof loop is shown in Algorithm 1. It implements unfailing completion [3] using a DISCOUNT loop [7]. The state consists of R, a set of rewrite rules and unorientable equations (the *active set*, initially empty); Q, the set of unprocessed critical pairs formed from R (the *passive set*, initially containing all the axioms); J, a set of ground joinable equations used for subsumption checking (following [1]); and the goal. The main loop removes the best critical pair from Q (see below), and if it is not redundant, adds it to R (oriented if possible) and adds all its critical pairs to Q. Every so often, the rules in R are reduced with respect to one another and redundant rules are removed. The goal is kept normalised with respect to R and the prover succeeds if the goal becomes trivial.<sup>1</sup>

The passive set is normally quadratic in the size of the active set: typical numbers are |R| « 10, 000 and |Q| « 10, 000, 000. Hence we must process each passive critical pair at high speed, but can spend time on each new rewrite rule.

*Term ordering.* We always use KBO, with all functions having weight 1, and ordered so that more frequently-occuring functions are smaller.

*Critical pair selection.* When a critical pair is added to Q, it is first normalised and then assigned a score; the proof loop selects the critical pair with the lowest score. The score function's job is to pick out promising critical pairs, and the choice of score function can make or break the prover. However, as it is applied to every critical pair, it also needs to be fast. We compute scores as follows:

<sup>1</sup> An equation is considered trivial if it is of the form <sup>t</sup> " <sup>t</sup>.

#### **Algorithm 1** The main proof loop

pR, J, Qq " pH, H, Aq **while** Q ‰ H **do** P " remove lowest-scoring element of Q **if** P's parent rules are still present in R **then** normalise P using R to get t " u **if** t ı u and t " u is not connected and t " u is not subsumed by J **then if** t " u is ground joinable **then** add t " u to J **else** orient t " u and add it to R **for all** critical pairs cp of t " u and R **do** normalise cp using only the oriented rules in R **if** cp is non-trivial **then** add cp to Q **end if end for** normalise goal using R **if** goal is trivial **then return** "theorem" **end if** simplify rules in R wrt each other, but limit this step to 5% of total runtime **end if end if end if end while return** "countersatisfiable"


*Proof production and checking.* Twee uses an LCF-style kernel [9] to guarantee soundness. Every member of the active set comes with a proof object, which is verified by a trusted proof checker (consisting of about a page of code). The proofs are low-level and thus easy to check: the only proof steps allowed are reflexivity, symmetry, transitivity, congruence and applying an axiom or lemma. It is not possible to add a rule to the active set without supplying a proof, and

any invalid proof step causes a fatal runtime error. The key to making this fast is that only the active set, not the passive set, includes proof objects.

Once the goal is proved, we transform the proof object into a human-readable proof, consisting of a flat sequence of rewrite steps. We also introduce lemmas, to avoid exponentially-sized proofs: any active rewrite rule is a candidate lemma. Our approach is similar to [8], but simpler as our proof steps are smaller; but their lemma selection strategy is smarter than ours and produces fewer lemmas.

*Goal transformation.* Twee's frontend can optionally transform the problem to make the prover more goal-directed. The transformation is simple, but strange. For every function term fp...q appearing in the goal, we introduce a fresh constant symbol a and add the axiom fp...q " a. For example, if the goal is fpgpaq, bq " hpcq, we add the axioms fpgpaq, bq " d1, gpaq " d2, and hpcq " d3. Simplification will rewrite the first axiom to fpd2, bq " d<sup>1</sup> and the goal to d<sup>1</sup> " d3.

By doing this transformation, (1) any subterm of the goal gets normalised to a constant, so critical pairs containing goal terms get a lower score, and (2) new critical pairs involving these constants appear, which are likely to be relevant to the goal. We evaluate this transformation in Section 5.

*Weak rewrite rules.* Completion sometimes deduces equations where both sides have a variable not occurring on the other side, such as fpx, yq " gpx, zq. Such equations are awkward for rewriting: suppose we want to use this equation to rewrite the term fpt, uq—what value should we choose for z?

Twee splits this equation into nicely-behaved rewrite rules instead. To do so, we introduce the concept of a *weak rewrite rule*. A weak rewrite rule t ù u is like an ordinary rewrite rule, except that it only satisfies t ě u, not t ą u. <sup>2</sup> Weak rewrite rules form critical pairs and participate in rewriting just like any other rewrite rule, except that to ensure termination, we may only perform the rewrite step tσ ù uσ if tσ ı uσ, i.e. tσ and uσ are syntactically different terms.<sup>3</sup>

Using weak rewrite rules, Twee splits fpx, yq " gpx, zq into the two rules fpx, yq Ñ gpx, Kq and gpx, zq ù gpx, Kq, where K is the *minimal* term in the term ordering. Note that gpx, zq ù gpx, Kq is a valid weak rewrite rule because gpx, zq ě gpx, Kq, with equality exactly when z " K.

As another example, the equation fpx, x, y, zq " gpx, y, y, wq is split into fpx, x, y, Kq " gpx, y, y, Kq, fpx, x, y, zq ù fpx, x, y, Kq and gpx, y, y, wq ù gpx, y, y, Kq. In this case, we are still left with an unorientable rule afterwards, but since it has the same variables on both sides it is unproblematic for rewriting.

It is always possible and safe to split an equation into an equivalent set of:


Twee does this whenever an equation is about to be added to R.

<sup>2</sup> <sup>t</sup> <sup>ě</sup> <sup>u</sup> means: for all grounding substitutions <sup>σ</sup>, either tσ <sup>ą</sup> uσ or tσ " uσ. <sup>3</sup> This is different from e.g. constrained rewriting: we can perform the rewrite even if <sup>t</sup> and u are unifiable, as long as they are not the same term right now.

#### **3 Redundancy Criteria**

The basic redundancy criterion of Knuth-Bendix completion is *joinability*: a critical pair can be discarded if both sides normalise to the same term. Joinability runs into problems when we have unorientable equations. For example, consider a rewrite system for an associative-commutative operator "`":

$$x + y = y + x \tag{1}$$

$$x(x+y) + z \to x + (y+z) \tag{2}$$

$$x + (y + z) = y + (x + z)\tag{3}$$

From (1) and (2) we get the critical pair <sup>x</sup>`py`z<sup>q</sup> (2) ÐÝÝ px`yq`z (1) ÝÝÑ z`px`yq, which cannot be rewritten any further so it is not joinable. However, the critical pair is redundant, because the above rewrite system is ground confluent. We would like to detect redundant but non-joinable critical pairs.

This section presents the redundancy criteria that Twee uses to handle unorientable equations: our take on the well-known approach of ground joinability testing [6], and a novel (we believe) approach based on connectedness [2]. Unlike the standard techniques for associative-commutative functions [1], our criteria handle any kind of permutative equation; we evaluate our approach in Section 5.

#### **3.1 Ground Joinability Testing**

Although the critical pair x ` py ` zqÐpx ` yq ` z Ñ z ` px ` yq is not joinable, all ground instances of it are joinable, and we say that the critical pair is ground joinable. For example, the instance a ` pb ` cqÐpa ` bq ` c Ñ c ` pa ` bq, with <sup>a</sup> <sup>ă</sup> <sup>b</sup> <sup>ă</sup> <sup>c</sup>, can be joined since <sup>c</sup> ` p<sup>a</sup> ` <sup>b</sup><sup>q</sup> (3) ÝÝÑ <sup>a</sup> ` p<sup>c</sup> ` <sup>b</sup><sup>q</sup> (1) ÝÝÑ a ` pb ` cq. Any ground joinable critical pair is redundant.

Martin and Nipkow [13] suggest an approach for checking ground joinability:


Their algorithm effectively does a case analysis on all possible variable orderings, but it is inefficient because there are so many possible orderings.

Our algorithm is similar, but tries to minimise the number of cases it considers. It does so by allowing orderings that: (1) constrain only a subset of the variables, such as x ă y, and (2) use ď, as in x ď y ă z. It works as follows:


*Example.* Take the critical pair x`py`zqÐpx`yq`z Ñ z`px`yq and suppose that we choose the ordering x ă y ă z. It can be joined when this order holds, as for any instance where <sup>x</sup> <sup>ă</sup> <sup>y</sup> <sup>ă</sup> <sup>z</sup>, we have <sup>z</sup>`px`y<sup>q</sup> (3) ÝÝÑ <sup>x</sup>`pz`y<sup>q</sup> (1) ÝÝÑ x`py`zq.

Having joined the critical pair in one case, we now generalise the case. We first try to remove each variable in turn, i.e. to join the critical pair in the three cases x ă y, y ă z, and x ă z in turn. None of these attempts succeeds.

Now we try replacing a ă with a ď, to get x ă y ď z. We must check if all ground instances satisfying x ă y ď z are joinable, but how? We might think of splitting this into two cases x ă y ă z and x ă y " z, but instead we are going to find *one rewrite proof* that works for both.

Consider the rewrite proof above. In it, the step x ` pz ` yq Ñ x ` py ` zq is fine if y ă z, but does not seem to be allowed if y " z. But in fact it is fine: if y " z, the terms x` pz `yq and x` py `zq are identical, so this rewrite step does nothing and can just be dropped. That is, the proof works both when x ă y ă z and x ă y " z, and shows joinability for the case x ă y ď z. We generalise the other ă similarly, showing that the critical pair is joinable in the case x ď y ď z.

Next, we pick another total order on the variables, but not one in which x ď y ď z. We might pick, for example, z ă y ă x. The process repeats: we show ground joinability under this ordering, and generalise it to z ď y ď x. We repeat until all cases are covered, and the ground joinability test succeeds.

Although our algorithm can be expensive in theory, in practice it needs to consider only a few orderings, and a small number of variables. Step (5) can occasionally be expensive, but by generalising ă to ď we can usually avoid it.

*The general case.* Here is how we test joinability under a given variable ordering. First, we parameterise our term order. Given an ordering C, we define t ě<sup>C</sup> u to mean that, for all grounding substitutions σ, if σ satisfies C then tσ ě uσ.

In the example, we weakened a ă to a ď. To do so, we used a rewrite step that, in some ground instances, rewrote a term to *the same term*. To allow these kind of steps, we loosen our definition of rewriting: we may perform a rewrite t Ñ u under C as long as t ě<sup>C</sup> u and t ı u. Rewriting terminates because given a rewrite proof t ě<sup>C</sup> u ě<sup>C</sup> v ě<sup>C</sup> ..., there is always a ground instance where t <sup>1</sup> ą<sup>C</sup> u<sup>1</sup> ą<sup>C</sup> v<sup>1</sup> ą<sup>C</sup> ..., since C was constructed as a strict order in step (1).

With this definition, normalising z`px`yq using the ordering C :" x ď y ď z yields z ` px ` yq Ñ x ` pz ` yq Ñ x ` py ` zq, where e.g. the first step is allowed because z ` x ě<sup>C</sup> x ` z and z ` x ı x ` z. Thus we can join our example critical pair under a given variable ordering just by normalising both sides, as we want.

The last ingredient is to implement a test for t ě<sup>C</sup> u, which we have done for KBO. The tricky part is checking whether weightptq ě weightpuq, which can be solved by taking the expression weightptq ´ weightpuq, a linear combination of the weights of t's and u's variables, and computing its minimum possible value.

One nice property is that the rest of the ground joining code is independent of the term order. To support e.g. LPO, one just needs to implement ě<sup>C</sup> for it.

*Why not allow arbitrary ordering constraints?* Some critical pairs can only be ground joined by using ordering constraints on arbitrary terms (e.g. x ` y ă z). We do not support these, as they make everything enormously more complex:


#### **3.2 Connectedness**

Ground joinability testing is rather heavyweight, constructing and analysing a sometimes large case split, and sometimes it fails because it only supports case splits on variables. Twee also supports a simpler, complementary method that works well when an unorientable equation is applied *under* another function.

The method makes use of *connectedness*. A critical pair s Ð t Ñ u is *connected* if there is a rewrite proof s " t<sup>1</sup> " ... " t<sup>n</sup> " u such that each t<sup>i</sup> is strictly less than t [2]. In Knuth-Bendix completion, any connected critical pair is redundant. In other words, when joining s Ð t Ñ u, we can do rewrite steps that *increase* the term, as long as the result is always strictly less than t.

Here is how we use connectedness. Let σ be a substitution that grounds s and u. When joining s Ð t Ñ u, we may want to perform a rewrite step v Ñ w using an unoriented equation, but we don't know if v ě w. We allow the rewrite step v Ñ w as long as: (1) w ă t, and (2) vσ ą wσ. Condition (1) ensures connectedness, and condition (2) ensures that rewriting eventually terminates.

For example, suppose we take the earlier rules for "`" and add a function f:

$$f(x+y, z+w) \to f(x, f(z, f(y, w)))\tag{4}$$

$$f(x, f(y, z)) = f(y, f(x, z))\tag{5}$$

Assume KBO with both f and ` having weight 1. One critical pair is <sup>f</sup>py, fpz, fpx, wqqq (4) ÐÝÝ <sup>f</sup>p<sup>y</sup> ` x, z ` <sup>w</sup><sup>q</sup> (1) ÐÝÝ <sup>f</sup>p<sup>x</sup> ` y, z ` <sup>w</sup><sup>q</sup> (4) ÝÝÑ fpx, fpz, fpy, wqq. We can show this to be connected using σ " tx Ñ a, y Ñ b, z Ñ c, w Ñ du, a ă b ă c ă d. The left term fpy, fpz, fpx, wqqq rewrites to fpy, fpx, fpz, wqqq using (5), because fpy, fpx, fpz, wqqq ă fpx ` y, z ` wq (connectedness) and fpb, fpc, fpa, dqqq ą fpb, fpa, fpc, dqqq (termination); and that rewrites to fpx, fpy, fpz, wqqq similarly. The right term fpx, fpz, fpy, wqqq also rewrites to fpx, fpy, fpz, wqqq. Thus the critical pair is redundant.

In general we try two choices of σ: one where the first variable in s " u is mapped to a1, the second to a2, and so on (with a<sup>1</sup> ă ... ă an); and another where the variables are mapped in reverse order. The critical pair is redundant if either choice of σ works. This is not a principled choice—most likely, some critical pairs need a different σ—but we do not know how to find the "best" σ.

#### **4 Implementation**

Twee consists of 5300 lines of Haskell code, comprising: terms, unification etc. (1150 lines); the frontend (850 lines); proof output (700 lines); general data structures (700 lines); the main proof loop (600 lines); joining, ground joining and connectedness (500 lines); critical pairs and the passive set (400 lines); term indexing (250 lines); and KBO (150 lines). This does not include TPTP parsing, clausification, etc., which are provided by the 4000-line Jukebox [16] program.

Most of Twee is written in a high-level, Haskell-idiomatic, somewhat inefficient style. Performance-critical parts (term manipulation, term indexing, and the passive set) are coded more carefully, and are described below. The bottleneck is usually normalising the many millions of critical pairs that are generated.

#### **4.1 Terms**

The simplest way to represent terms in Haskell, as trees, is not ideal: it creates pressure on the garbage collector, and core operations such as matching and unification become heavily recursive and needlessly slow.

Instead, we represent terms as *flatterms*—the term is flattened into a list of symbols and stored in an array. In order to preserve the structure of the term, each symbol is paired with a number giving the size of the subterm rooted at that symbol. For example, the term fpx, gpx, yqq is represented as:

$$\boxed{f:5\Big\vert\begin{matrix}x:1\end{matrix}\Big\vertg:3\Big\vert\begin{matrix}x:1\end{matrix}\Big\verty:1\Big\vert}$$

where e.g. g : 3 indicates a subterm with root g that is 3 symbols long (g, x, y).

In addition, each function and variable has an *ID number*, and the term stores those ID numbers, rather than a pointer to the function or variable. So, in the array above, the "f" really means the ID number of f. Functions have positive ID numbers, and variables negative, so they can be easily told apart, and there is a separate global array which maps ID numbers to functions. This design allows us to represent a term as a simple array of integers, so that pressure on the garbage collector is reduced. Also, comparing two terms for equality just amounts to a bytewise comparison of the arrays (a C memcmp). What's more, by using array slicing, we can view a term's subterms as flatterms in their own right.

On top of this we build a higher-level API. There are two types, terms and termlists, both implemented as flatterms. With the help of Haskell's user-defined patterns, they are exposed to the user as ordinary algebraic datatypes. We can use normal pattern matching to e.g. check if a term is a function or variable, access its children (as a termlist), iterate through it a symbol or subterm at a

time, etc. All these operations turn into a few machine instructions. Matching and unification are implemented using this API as efficient tail-recursive loops.

#### **4.2 Indexing**

Rewriting uses a perfect discrimination tree [15], including Waldmeister's refinements [12]. The implementation takes care not to create backtracking points unless needed. There is no unification index, since this is not usually a bottleneck.

## **4.3 The Passive Set**

Early versions of Twee often ran out of memory after about 30 minutes. The reason is the passive set—it grows quadratically in the number of active rules, because any pair of rules can have a critical pair. In typical prover runs it contains anywhere between a million and a hundred million critical pairs.

Twee now uses a space-efficient passive set representation adapted from Waldmeister [12]. The main idea is to throw away all terms involved in the critical pair, and only remember: (1) the ID numbers of the two rules involved, (2) the position of the overlap, and (3) the score of the critical pair. When a critical pair is selected, the ID numbers and position are used to reconstruct the critical pair. This design uses about 12 bytes of memory per critical pair, so Twee can run for many hours without running out of memory.

# **5 Evaluation**

In this section we report on two evaluations: one investigating the effect of the different redundancy criteria of Section 3, and one comparing the performance of Twee against E 2.5 and Waldmeister. In both cases we ran Twee on all 981 unsatisfiable UEQ problems from TPTP 7.4.0, with a time limit of 5 minutes.

*Redundancy criteria.* Figure 1a shows how the performance of Twee varies depending on which redundancy criteria are enabled. The x-axis shows the number of problems solved (starting from problem 600) and the y-axis shows the runtime for that problem. The combination of ground joinability testing and connectedness is much stronger than either on their own—it seems that each catches cases that the other misses. It is clearly best to have both switched on.

The figure also includes a variant of Twee which implements the heuristic for AC functions described in [1] (and no other redundancy criterion), which solves fewer problems than our approach. This is perhaps not surprising, as our approach handles a wider class of functions.

*Twee, E, Waldmeister.* Figure 1b compares Twee's performance against E and Waldmeister. Twee is run in three variations: with and without the goal-directed transformation from Section 2, and as a timesliced version which runs the other two versions for 150s each. By far the best choice for Twee is to timeslice, when it comes close to Waldmeister's performance. This suggests that Twee with and without the goal transformation solve somewhat different sets of problems.

(a) Different redundancy criteria. (b) Compared against Waldmeister and E.

Fig. 1: Benchmarks.

#### **6 Future Work**

Knuth-Bendix completion pays little attention to the goal: it simply completes the rewrite system until the goal becomes trivial. We plan to search for ways to make Twee more goal-directed, for example by rewriting the goal backwards somewhat in the style of [18]. The success of the goal transformation shows that goal direction ought to be important.

Twee uses a fixed term ordering, which is clearly a weakness on certain problem kinds such as RNG. We do not want to choose a term order based on syntactic analysis of the problem, but would like to choose it dynamically based on the state of the proof, perhaps by incorporating ideas from MædMax [19].

#### **7 Conclusion**

Twee is a unit equality prover implemented in 5300 lines of Haskell code. Its performance is good, thanks to a careful implementation, strong redundancy criteria and a transformation to help goal-directness. It performs particularly strongly on problems involving permutative laws, such as those in LAT and REL. Its main weaknesses are that it always uses a fixed term order, and has only weak goal direction. We hope that a future version of Twee, with real goal direction and a smart choice of term order, will be even stronger.

*Acknowledgements.* This work was supported by the Swedish Research Council (VR) grant 2016-06204, *Systematic Testing of Cyber-Physical Systems (SyTeC)*.

We thank the reviewers for their many helpful comments.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# The Isabelle/Naproche Natural Language Proof Assistant

Adrian De Lon<sup>1</sup> , Peter Koepke<sup>1</sup> , Anton Lorenzen<sup>1</sup> , Adrian Marti<sup>1</sup> , Marcel Schütz<sup>1</sup> , and Makarius Wenzel<sup>2</sup>

<sup>1</sup> University of Bonn, Bonn, Germany, https://www.math.uni-bonn.de/ag/logik <sup>2</sup> Augsburg, Germany, https://sketis.net

Abstract aproche is an emerging *natural proof assistant* that accepts input in the controlled natural language ForTheL. aproche is included in the current version of the Isabelle/PIDE which allows comfortable editing and asynchronous proof-checking of ForTheL texts. The .tex dialect of ForTheL can be typeset by LATEX into documents that approximate the language and appearance of ordinary mathematical texts.

## 1 Introduction

aproche (for Natural Proof Checking) is an emerging *natural* proof assistant that accepts input in a controlled natural language, approximating ordinary mathematical language and texts. The system uses


The current version of aproche also introduces a LATEX dialect of ForTheL so that high-quality mathematical typesetting is readily available. aproche allows the formalization and proof-checking of advanced mathematics in a style that is immediately readable by mathematicians. Example formalizations from various domains of undergraduate mathematics are included.

aproche ships as a component in the latest release of the Isabelle prover platform [8]. When editing a ForTheL file in Isabelle/jEdit Prover IDE (PIDE), there is an auxiliary aproche server in the background to quickly answer requests for checking ForTheL texts, with an internal cache to avoid repeated checking of unchanged text segments. The implementation uses programming interfaces of Isabelle/PIDE that allow user-defined file formats to participate in the concurrent document model. A second auxiliary server allows the aproche program to run external prover processes under the control of Isabelle, with explicit timeouts. This works reliably on the usual platforms (Linux, Windows, macOS)

c The Author(s) 2021 A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 614 624, 2021. https://doi.org/10.1007/978-3-030-79876-5\_36 –

by re-using external provers of Isabelle/Sledgehammer [17]. From the perspective of logic, there is *no connection* of aproche with Isabelle/Sledgehammer or any other Isabelle/HOL tools.

In this paper we briefly discuss the need for *natural* proof assistants, provide some general information on Isabelle/Naproche, and give an overview of methods employed in the system, using an excerpt from a formalization of Euclid's infinitude of primes as a running example. To conclude we compare aproche to other projects in formal mathematics with natural language input and indicate ways to further extend aproche's naturalness and efficiency.

#### 2 Natural Proof Assistants

While state-of-the-art interactive theorem provers have been successfully used to prove and certify highly non-trivial research mathematics, they are still, according to Lawrence Paulson [16] "unsuitable for mathematics. Their formal proofs are unreadable."

*Natural proof assistants* intend to bridge the wide gap between intuitive mathematical texts and the formal rigour of logical calculi. We propose the following criteria for natural proof assistants:


We expect that naturalness will be crucial for the adoption of formal mathematics by the wider mathematical community. This is in line with some ongoing large-scale projects in formal mathematics. For instance, the *ALEXANDRIA* project by Paulson [16] stipulates:

*ALEXANDRIA will be based on legible structured proofs. Formal proofs should be not mere code, but a machine-checkable form of communication between mathematicians.*

The *Formal Abstracts* project of Thomas Hales [5] intends to


#### 3 Isabelle/Naproche

The aproche proof assistant stems from two long-term efforts aiming towards naturalness: the Evidence Algorithm (EA) and System for Automated Deduction (SAD) projects at the universities of Kiev and Paris [14,15,20,21], and the Naproche project at Bonn [1,2,3,10]. aproche extends the input language ForTheL of SAD and embeds it into LATEX, allowing mathematical typesetting; the original proof-checking mechanisms of SAD have been made more efficient and varied.

The first experimental integration of the then Naproche-SAD prover into the Isabelle Prover IDE was done in 2018 by Frerix and Wenzel [23, §1.2]. The current (refined and extended) version has now become a bundled component of Isabelle2021 [8]. After downloading and unpacking the Isabelle distribution, Isabelle/Naproche becomes immediately accessible in the *Documentation* panel, section *Examples*, entry \$ISABELLE\_NAPROCHE/Intro.thy. Isabelle and its addon components work directly without manual installation, but this comes at the cost of substantial resource requirements: on Linux the total size is 1.2 GB, which includes Java 15 (330 MB), E prover 2.5 (30 MB), and aproche (20 MB). The bulk of other Isabelle components are required for Isabelle/HOL theory and proof development, but aproche has no logical connection to that.

The aproche prover is invoked automatically when editing ForTheL files with .ftl or .ftl.tex extensions. Further examples and an introductory tutorial are linked in the Isabelle theory file \$ISABELLE\_NAPROCHE/Intro.thy: as usual for Isabelle/jEdit and other IDEs, following a link works by a mouse click combined with the keyboard modifier CTRL (Linux, Windows) or CMD (macOS). The examples deal with results from undergraduate number theory, geometry, and set theory; most are available in the classic ASCII style as well as in LATEX style and typeset in PDF.

The ForTheL library FLib [13] contains a variety of formalizations for earlier versions of aproche. Some substantial texts have been written as undergraduate student projects and cover, e.g., group theory up to Sylow theorems, initial chapters from Walter Rudin's *Analysis*, or set theory up to Silver's theorem in cardinal arithmetic. These texts will soon be upgraded to the new version of aproche and included in an interlinked formalized library of readable and proof-checked mathematical texts.

#### 4 Formalizing in ForTheL

#### 4.1 Example

The following screenshot shows a proof of the infinitude of prime numbers in the Isabelle/Naproche Prover IDE taken from the bundled tutorial which itself is a proof-checked ForTheL text:

The editor buffer contains the ForTheL source, which also happens to conform to standard LATEX format. (The "Contradiction" lemma, now deactivated by a %, is a left-over of a typical check for hidden inconsistencies in the axiomatic setup.) The Output panel contains feedback from the aproche prover about the source document: "verification successful" and some statistics; the most relevant messages are also shown in-line over the source as squiggly underline with popup on mouse-hovering. The Sidekick/latex structure overview is provided by standard plugins of the underlying text editor. This piece of mathematics is typeset by LATEX as follows:

#### Euclid's Theorem

Signature. P is the class of prime natural numbers. Theorem. P is infinite. *Proof.* Assume that r is a natural number and p is a sequence of length r and {p1,...,pr} is a subclass of <sup>P</sup>. [...] -

#### 4.2 The ForTheL Language

The mathematical controlled language ForTheL has been developed over several decades in the Evidence Algorithm (EA) / System for Automated Deduction (SAD) project. It is carefully designed to approximate the weakly typed natural language of mathematics whilst being efficiently translatable to first-order logic. In ForTheL, standard mathematical types are called *notions*, and these are internally represented as predicates with a distinguished variable, which are treated as unary predicates with the other variables used as parameters ("types as predicates"). This leads to a flexible dependent type system where number

systems can be cumulative (<sup>N</sup> <sup>⊆</sup> <sup>R</sup>), and notions can depend on parameters (subsets of N, divisors of n).

First-order languages of notions, constants, relations, and functions can be introduced and extended by *signature* and *definition* commands. The formalization of Euclid's theorem, e.g., sets out like:

Signature. A natural number is a small object. Let . . . m, n . . . denote natural numbers. Signature. 0 is a natural number. ··· Signature. m + n is a natural number.

#### 5 Architecture of the aproche System

aproche follows standard principles of interactive theorem proving, but with a strong emphasis on the naturalness aspects explained above. The general information processing in the system is described in the following diagram. The core aproche program is implemented in Haskell.

In the sequel we shall describe main components of aproche.

#### 5.1 Tokenizing and Parsing

aproche uses a standard tokenizing algorithm for cutting text up into a list of meaningful tokens, with precise source positions to enable PIDE messages and markup, e.g., by colours for free and bound variables. When using LATEX syntax, the tokenizer also takes care of expanding certain TEX commands (see the next subsection).

Parsing is carried out in Haskell's monadic style with parser combinators. We allow ambiguous parsing, since it better fits natural language. Currently the translation into tagged first-order logic is already part of the parsing process. The following translation of our example snippet was obtained by running aproche from the command line with the -T (translate) option:

```
......
hypothesis.
  assume forall v0 ((HeadTerm :: v0 = Primes) implies
  (aClass(v0) and forall v1 (aElementOf(v1,v0)
  iff (aNaturalNumber(v1) and isPrime(v1))))).
conjecture Euclid.
  isInfinite(Primes).
  proof.
    assume ((aNaturalNumber(r) and aSequenceOfLength(p,r)) and
    aSubsetOf(Set{p}{1}{r},Primes)).
    n = Prod{p}{1}{r}+1.
......
```
In order to make aproche more versatile we plan on parsing into an abstract syntax tree instead, so that different logical back-ends could translate into different logics. We have already made some experiments on translating ForTheL to Lean [12].

Moreover, with the input language growing, we shall eventually turn to some grammatical framework to speed up language development without hard-coding vocabulary or grammar rules into the aproche code.

# 5.2 LATEX Processing

We have extended aproche to support a .ftl.tex format, in addition to the original .ftl format. Files in .ftl.tex format are intended to be readable by both aproche for logical checking and by LATEX for typesetting.

The LATEX tokenizer ignores the whole document, except what is inside forthel environments of the form

```
\begin{forthel}
```

```
% Insert what you want Naproche to process here
\end{forthel}
```
In a forthel environment, standard LATEX syntax can be used for declaring text environments for theorems and definitions.

In aproche, users can define their own operators and phrases by defining linguistic and symbolic *patterns*. This mechanism has been adapted to allow LATEX constructs in patterns. In the Euclid text we use the pattern \Set{p}{1}{r} for the finite set {p1,...,pr}. By defining \Set as a LATEX macro we can arrange that the ForTheL pattern will be printed in the familiar set notation:

```
\newcommand{\Set}[3]{\{#1_{#2},\dots,#1_{#3}\}}
```
There are some primitive concepts in aproche, such as the logical operators ∨, ∧, ∃ that are directly recognized in the LATEX source and expanded to corresponding internal tokens.

The current release of aproche does not differentiate between math mode and text mode in LATEX, since it re-uses much of the parsing machinery of the original .ftl format. Future releases shall make such a distinction to increase the robustness of the parser, improve error messages and resolve some ambiguities in the current grammar.

#### 5.3 Logical Processing

The first-order formulas derived from ForTheL statements are put into an internal ProofText data type consisting of blocks of formulae, arranged in a treelike fashion. The tree structure mirrors the logical structure of a text, where a statement can be seen as a node to which a subtext, e.g., its proof is attached. Since statements in a proof can have their own subproofs this leads to a recursive tree structure, on which the further checking is performed along a depth-first left-to-right traversal.

#### 5.4 Ontological Checking by the aproche Reasoner

An innocent mathematical statement like a<sup>2</sup> + b<sup>2</sup> = c<sup>2</sup> contains a number of implicit proof tasks, even if the whole statement is not to be proved, but part of a definition or an assumption. One has to check that a, b, c are (numerical) terms to which the squaring operation can be applied, and that the resulting squares can be subjected to addition and equality. These checks are called "ontological", and they roughly correspond to type checking in type-orientated systems. The situation here is however more complicated, as types (i.e. notions) and operations may involve first-order definitions with preconditions, which cannot be decided during the parsing process but only during proof-checking. So in the checking process each node of the aforementioned tree is first checked *ontologically*; if the node formula itself is marked as a conjecture, it is *logically* checked.

#### 5.5 Logical Checking by the aproche Reasoner

The various checks are organized by the aproche reasoner module. In simple cases the reasoner itself can supply a proof; if not, the reasoner constructs proof tasks for the ATP. Since definitions in first-order logic are formally symmetric equivalences, they may lead to circularities in proof searches. Instead definitions are successively unfolded by replacing the definiendum by the definiens. This process may be iterated when proof attempts fail.

The ATP is given certain timeouts to search for proofs. Ontological checking is supposed to be easier than proper mathematical proving. So the default time for each ontological check is set to 1 sec, whereas proving gets 3 sec and can be iterated for several rounds of definition unfolding.

#### 5.6 Communication with an External ATP

Proof tasks are translated into the generic TPTP first-order format for ATPs. These can be viewed in the Output window of Isabelle/jEdit, after inserting the directive [dump on] into the ForTheL source. The final proof task in checking Euclid's proof ends with the TPTP lines:

```
fof(m_,hypothesis,( ! [W0] : (aClass(W0) =>
    (isInfinite(W0) <=> ( ~ isFinite(W0)))))).
fof(m_,hypothesis,(aClass(szPzrzizmzezs) &
    ( ! [W0] : (aElementOf(W0,szPzrzizmzezs)
    <=> (aNaturalNumber(W0) & isPrime(W0)))))).
fof(m__,conjecture,
    ......
    (aElementOf(W4,szSzeztlcdtrclcz1rclcdtrc(W0,W1)) <=>
    (aNaturalNumber(W4) & isPrime(W4))))))))))))) =>
    isInfinite(szPzrzizmzezs))).
```
By default aproche uses E prover [19] as external ATP, but one may switch to other provers available in the Isabelle distribution.

## 6 Integration into Isabelle

The initial integration of aproche into the Isabelle Prover IDE happened in 2018 and is briefly reported as an example in the PIDE overview article [23] based on Isabelle2019 (June 2019). The main idea was to turn the existing Haskell command-line program into a TCP server that can answer concurrent requests for checking ForTheL texts in a purely functional manner, with proper handling of cancel messages (for interrupts caused by user editing); this required to remove a few low-level system operations, like reading physical files or exit of the process. Afterwards, the semantic operation forthel\_file in Isabelle – to check ForTheL text and produce markup messages according to the PIDE protocol – was implemented as Isabelle/Isar command in Isabelle/ML as usual, but the main work is delegated to the aproche server. Its implementation uses the Isabelle/Haskell library for common Isabelle/PIDE message formats, source positions, markup etc. – it is maintained within the Isabelle distribution.

The current version of Isabelle/Naproche refines this approach in various respects. In particular, Isabelle2021 now provides a standard mechanism for user-defined *Isabelle/Scala services*: this is both relevant for Isabelle commandline tools to build and test Isabelle/Naproche, and the Prover IDE support of ForTheL files to connect the Isabelle/jEdit front-end to the aproche back-end.

Moreover, the Java process running the Prover IDE provides an additional TCP server to launch external provers that are already distributed with Isabelle (thanks to Isabelle/Sledgehammer): aproche applications mainly use the current E prover 2.5 [19], but SPASS and Vampire are available for experiments.

The existing management of processes in Isabelle/Scala involves considerable efforts to robustly support interrupts and timeouts in a concurrent environment; this works on all platforms supported by Isabelle (using special tricks for Windows/Cygwin, and macOS/Rosetta on Apple Silicon).

The documentation file \$ISABELLE\_NAPROCHE/Intro.thy gives further hints on implementation near the end, with hyperlinks to the sources. A lot of technical Isabelle infrastructure is re-used by Isabelle/Naproche, but there is presently no connection to Isabelle/HOL, which is a much larger and better-known application of the same Isabelle framework [18].

# 7 Related and Future Work

Bridging the gap between mathematical practice and fully formal methods has always been a central concern in formal mathematics. The development of the Mizar system [11] was accompanied or even driven by the stepwise adaptation of its language to standard mathematical proof methods and logical foundations. In contrast, most interactive theorem provers feature formal tactic languages, with tactics scripts that can hardly be understood without stepwise tracing and reconstructing internal logical states.

The Mizar language has been a role model for other proof languages. There are, e.g., "Mizar modes" for HOL [6,25] and Coq [4] and the widely used Isar language for Isabelle [24,22]. These language can be read by mathematicians, with some effort, but they retain a strong bias toward computer science customs. A survey of input languages for formalization on a scale between formal and natural can be found in [9].

Only a few formal mathematics projects have aimed at processing actual mathematical language. These projects have operated in isolation and seem to be mostly inactive now. The paper [7] by Muhammad Humayoun and Christophe Raffalli, e.g., describes the MathNat project and also surveys other related attempts.

The Naproche approach can be viewed in the Mizar tradition: use a rich controlled language for mathematics, increase the proving capabilities by strong automated theorem proving, and, eventually, create an extensive library of basic mathematics and specialized theories, which simultaneously can be used as a library for human readers.

The readability and naturalness of texts which proof-check in the aproche system motivate significant further extensions of the project where ad hoc methods are to be replaced by principled and established approaches:

1. the input language ForTheL has to be extended for wide mathematical coverage; ForTheL needs an extensive formal grammar and vocabulary to be processed by strong linguistic methods; the vocabulary may also encompass standard LATEX symbols and semantic information;

2. methods of type derivation and elaboration should be provided;

3. Isabelle/Sledgehammer-like methods should lead to efficient premise selection in large texts and theories;

4. the creation of libraries of ForTheL documents requires import and export mechanisms corresponding to quoting and referencing in the mathematical literature;

5. the natural text processing of aproche should be interfaced with other proof assistants to leverage their strengths and libraries. We shall in particular work on a "aproche mode" for Isabelle.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Programming Language The Lean 4 Theorem Prover and

Leonardo de Moura1(B) and Sebastian Ullrich<sup>2</sup>

<sup>1</sup> Microsoft Research, Redmond WA, USA leonardo@microsoft.com <sup>2</sup> Karlsruhe Institute of Technology, Karlsruhe, Germany sebastian.ullrich@kit.edu

Abstract. Lean 4 is a reimplementation of the Lean interactive theorem prover (ITP) in Lean itself. It addresses many shortcomings of the previous versions and contains many new features. Lean 4 is fully extensible: users can modify and extend the parser, elaborator, tactics, decision procedures, pretty printer, and code generator. The new system has a hygienic macro system custom-built for ITPs. It contains a new typeclass resolution procedure based on tabled resolution, addressing significant performance problems reported by the growing user base. Lean 4 is also an efficient functional programming language based on a novel programming paradigm called *functional but in-place*. Efficient code generation is crucial for Lean users because many write custom proof automation procedures in Lean itself.

# 1 Introduction

The Lean project<sup>3</sup> started in 2013 [9] as an interactive theorem prover based on the Calculus of Inductive Constructions [4] (CIC). In 2017, using Lean 3, a community of users with very different backgrounds started the Lean mathematical library project mathlib [13]. At the time of this writing, mathlib has roughly half a million lines of code, and contains many nontrivial mathematical objects such as Schemes [2]. Mathlib is also the foundation for the Perfectoid Spaces in Lean project [1], and the Liquid Tensor challenge [11] posed by the renowned mathematician Peter Scholze. Mathlib contains not only mathematical objects but also Lean metaprograms that extend the system [5]. Some of these metaprograms implement nontrivial proof automation, such as a ring theory solver and a decision procedure for Presburger arithmetic. Lean metaprograms in mathlib also extend the system by adding new top-level command and features not related to proof automation. For example, it contains a package of semantic linters that alert users to many commonly made mistakes [5]. Lean 3 metaprograms have

<sup>3</sup> http://leanprover.github.io

<sup>©</sup> The Author(s) 2021

A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 625–635, 2021. https://doi.org/10.1007/978-3-030-79876-5\_37

been also instrumental in building standalone applications, such as a SQL query equivalence checker [3].

We believe the Lean 3 theorem prover's success is primarily due to its extensibility capabilities and metaprogramming framework [6]. However, users cannot modify many parts of the system without changing Lean 3 source code written in C++. Another issue is that many proof automation metaprograms are not competitive with similar proof automation implemented in programming languages with an efficient compiler such as C++ and OCaml. The primary source of inefficiency in Lean 3 metaprograms is the virtual machine interpretation overhead.

Lean 4 is a reimplementation of the Lean theorem prover in Lean itself<sup>4</sup>. It is an extensible theorem prover and an efficient programming language. The new compiler produces C code, and users can now implement efficient proof automation in Lean, compile it into efficient C code, and load it as a plugin. In Lean 4, users can access all internal data structures used to implement Lean by merely importing the Lean package. Lean 4 is also a platform for developing efficient domain-specific automation. It has a more robust and extensible elaborator, and addresses many other shortcomings of Lean 3. We expect the Lean community to extend and add new features without having to change the Lean source code. We released Lean 4 at the beginning of 2021, it is open source, the community is already porting mathlib, and the number of applications is quickly growing. It includes a translation verifier for Reopt<sup>5</sup>, a package for supporting inductive-inductive types<sup>6</sup>, and a car controller<sup>7</sup>.

# 2 Lean by Example

In this section, we introduce the Lean language using a series of examples. The source code for the examples is available at https://github.com/leanprover/ lean4/blob/cade2021/doc/BoolExpr.lean. For additional details and installation instructions, we recommend the reader consult the online manual<sup>8</sup>.

We define functions by using the def keyword followed by its name, a parameter list, return type, and body. The parameter list consists of successive parameters that are separated by spaces. We can specify an explicit type for each parameter. If we do not specify a specific argument type, the elaborator tries to infer the function body's type. The Boolean or function is defined by pattern-matching as follows

```
def or (a b : Bool) :=
  match a with
  | true => true
  | false => b
```
<sup>4</sup> http://github.com/leanprover/lean4

<sup>5</sup> https://github.com/GaloisInc/reopt-vcg

<sup>6</sup> https://github.com/javra/iit

<sup>7</sup> https://github.com/GaloisInc/lean4-balance-car

<sup>8</sup> http://leanprover.github.io/lean4/doc

We can use the command #check <term> to inspect the type of term, and #eval <term> to evaluate it.

#check or true false -- Bool (this is a comment in Lean) #eval or true false -- true

Lean has a hygienic macro system and comes equipped with many macros for commonly used idioms. For example, we can also define the function or using

```
def or : Bool → Bool → Bool
 | true, _ => true
 | false, b => b
```
The notation above is a macro that expands into a match-expression. In Lean, a theorem is a definition whose result type is a proposition. For an example, consider the following simple theorem about the definition above

```
theorem or_true (b : Bool) : or true b = true :=
  rfl
```
The constant rfl has type ∀ {α : Sort u} {a : α}, a = a, the curly braces indicate that the parameters α and a are implicit and should be inferred by solving typing constraints. In the example above, the inferred values for α and a are Bool and or true b, respectively, and the resulting type is or true b = or true b. This is a valid proof because or true b is definitionally equal to b. In dependent type theory, every term has a computational behavior, and supports a notion of reduction. In principle, two terms that reduce to the same value are called definitionally equal. In the following example, we use pattern matching to prove that or b b = b

```
theorem or_self : ∀ (b : Bool), or b b = b
  | true => rfl
  | false => rfl
```
Note that or b b does not reduce to b, but after pattern matching we have that or true true (or false false) reduces to true (false).

In the following example, we define the recursive datatype BoolExpr for representing Boolean expressions using the command inductive.

```
inductive BoolExpr where
  | var (name : String)
  | val (b : Bool)
  | or (p q : BoolExpr)
  | not (p : BoolExpr)
```
This command generates constructors BoolExpr.var, BoolExpr.val, BoolExpr.or, and BoolExpr.not. The Lean kernel also generates an inductive principle for the new type BoolExpr. We can write a basic "simplifier" for Boolean expressions as follows

```
def simplify : BoolExpr → BoolExpr
  | BoolExpr.or p q => mkOr (simplify p) (simplify q)
  | BoolExpr.not p => mkNot (simplify p)
```

```
where
 mkOr : BoolExpr → BoolExpr → BoolExpr
   | p, BoolExpr.val true => BoolExpr.val true
   | p, BoolExpr.val false => p
   | BoolExpr.val true, p => BoolExpr.val true
   | BoolExpr.val false, p => p
   | p, q => BoolExpr.or p q
 mkNot : BoolExpr → BoolExpr
   | BoolExpr.val b => BoolExpr.val (not b)
   | p => BoolExpr.not p
```
The function simplify is a simple bottom-up simplifier. We use the where clause to define two local auxiliary functions mkOr and mkNot for constructing "simplified" or and not expressions respectively. Their global names are simplify.mkOr and simplify.mkNot.

Given a context that maps variable names to Boolean values, we define a "denotation" function (or evaluator) for Boolean expressions. We use an association list to represent the context.

```
abbrev Context := AssocList String Bool
def denote (ctx : Context) : BoolExpr → Bool
  | BoolExpr.or p q => denote ctx p || denote ctx q
  | BoolExpr.not p => !denote ctx p
  | BoolExpr.val b => b
  | BoolExpr.var x => if let some b := ctx.find? x then b else false
```
In the example above, p || q is notation for or p q, !p for not p, and if let p := t then a else b is a macro that expands into match t with | p => a | \_ => b. The term ctx.find? x is syntax sugar for AssocList.find? ctx x.

As in previous versions, we can use tactics for constructing proofs and terms. We use the keyword by to switch into tactic mode. Tactics are user-defined or built-in procedures that construct various terms. They are all implemented in Lean itself. The simp tactic implements an extensible simplifier, and is one of the most popular tactics in mathlib. Its implementation <sup>9</sup> can be extended and modified by Lean users.

```
...
@[simp] theorem denote_mkOr (ctx : Context) (p q : BoolExpr)
        : denote ctx (simplify.mkOr p q) = denote ctx (or p q) :=
  ...
def denote_simplify (ctx : Context) (p : BoolExpr)
    : denote ctx (simplify p) = denote ctx p :=
  by induction p with
  | or p q ih1 ih2 => simp [ih1, ih2]
```
<sup>9</sup> https://github.com/leanprover/lean4/blob/cade21/src/Lean/Meta/Tactic/ Simp/Main.lean.


In the example above, we use the induction tactic, its syntax is similar to a matchexpression. The variables ih<sup>1</sup> and ih<sup>2</sup> are the induction hypothesis for p and q in the first alternative for the case p is a BoolExpr.or. The simp tactic uses any theorem marked with the @[simp] attribute as a rewriting rule (e.g., denote\_mkOr). We explicitly provide the induction hypotheses as additional rewriting rules inside square brackets.

Typeclass Resolution. Typeclasses [16] provide an elegant and effective way of managing ad-hoc polymorphism in both programming languages and interactive proof assistants. Then we can declare particular elements of a typeclass to be instances. These provide hints to the elaborator: any time the elaborator is looking for an element of a typeclass, it can consult a table of declared instances to find a suitable element. What makes typeclass inference powerful is that one can chain instances, that is, an instance declaration can in turn depend on other instances. This causes class inference to recurse through instances, backtracking when necessary. The Lean typeclass resolution procedure can be viewed as a simple λ-Prolog interpreter [8], where the Horn clauses are the user declared instances.

For example, the standard library defines a typeclass Inhabited to enable typeclass inference to infer a "default" or "arbitrary" element of types that contain at least one element.

```
class Inhabited (α : Sort u) where
  default : α
def arbitrary [Inhabited α] : α :=
  Inhabited.default
```
The annotation [Inhabited α] at arbitrary indicates that this implicit parameter should be synthesized from instance declarations using typeclass resolution. We can define an instance for our BoolExpr type defined earlier as follows

```
instance : Inhabited BoolExpr where
  default := BoolExpr.val false
```
This instance specifies that the "default" element for BoolExpr is BoolExpr.val false. The following declaration shows that if two types α and β are inhabited, then so is their product:

```
instance [Inhabited α] [Inhabited β] : Inhabited (α × β) where
  default := (arbitrary, arbitrary)
```
The standard library has many builtin classes such as Repr α and DecidableEq α. The class Repr α is similar to Haskell's Show α typeclass, and DecidableEq α is a typeclass for types that have decidable equality. Lean 4 also provides code synthesizers for many builtin classes. The command deriving instructs Lean to auto-generate an instance.

deriving instance DecidableEq for BoolExpr

```
#eval decide (BoolExpr.val true = BoolExpr.val false) -- false
```
In the example above, the deriving command generates the instance

(a b : BoolExpr) → Decidable (a = b)

The function decide evaluates decidable propositions. Thus, the last command returns false since BoolExpr.val true is not equal to BoolExpr.val false.

The increasingly sophisticated uses of typeclasses in mathlib have exposed a few limitations in Lean 3: unnecessary overhead due to the lack of term indexing techniques, and exponential running times in the presence of diamonds. Lean 4 implements a new procedure [12], tabled typeclass resolution, that solves these problems by using discrimination trees<sup>10</sup>. for better indexing and tabling, which is a generalization of memoizing introduced initially to address similar limitations of early logic programming systems<sup>11</sup>.

The hygienic macro system. In interactive theorem provers (ITPs), Lean included, extensible syntax is not only crucial to lower the cognitive burden of manipulating complex mathematical objects, but plays a critical role in developing reusable abstractions in libraries. Lean 3 support such extensions in the form of restrictive "syntax sugar" substitutions and other ad hoc mechanisms, which are too rudimentary to support many desirable abstractions. As a result, libraries are littered with unnecessary redundancy. The Lean 3 tactic languages is plagued by a seemingly unrelated issue: accidental name capture, which often produces unexpected and counterintuitive behavior. Lean 4 takes ideas from the Scheme family of programming languages and solves these two problems simultaneously by use of a hygienic, i.e. capture-avoiding, macro system custom-built for ITPs [15].

Lean 3's "mixfix" notation system is still supported in Lean 4, but based on the much more general macro system; in fact, the Lean 3 notation keyword itself has been reimplemented as a macro, more specifically as a macro-generating macro. By providing such a tower of abstractions for writing syntax sugars, of which we will see more levels below, we want to enable users to work in the simplest model appropriate for their respective use case while always keeping open the option to switch to a lower, more expressive level.

As an example, we define the infix notation Γ p, with precedence 50, for the function denote defined earlier.

infix:50 "" => denote

The infix command expands to

notation:50 Γ "" p:50 => denote Γ p

<sup>10</sup> https://github.com/leanprover/lean4/blob/cade21/src/Lean/Meta/ DiscrTree.lean.

<sup>11</sup> https://github.com/leanprover/lean4/blob/cade21/src/Lean/Meta/ SynthInstance.lean.

which itself expands to the macro declaration

```
macro:50 Γ:term "" p:term:50 : term => `(denote $Γ $p)
```
where the syntactic category (term) of placeholders and of the entire macro is now specified explicitly, implying that macros can also be written for/using other categories such as the top-level command. The right-hand side uses an explicit syntax quasiquotation to construct the syntax tree, with syntax placeholders (antiquotations) prefixed with \$. As suggested by the explicit use of quotations, the right-hand side may now be an arbitrary Lean term computing a syntax object, allowing for procedural macros as well.

macro itself is another command-level macro that, for our notation example, expands to two commands

```
syntax:50 term "" term:50 : term
macro_rules
  | `($Γ  $e) => `(denote $Γ $e)
```
that is, a pair of parser extension and syntax transformer. By separating these two steps at this abstraction level, it becomes possible to define (mutually) recursive macros and to reuse syntax between macros. Using macro\_rules, users can even extend existing macros with new rules. In general, separating parsing and expansion means that that we can obtain a well-structured syntax tree pre-expansion, i.e. a concrete syntax tree, and use it to implement source code tooling such as auto-completion, go-to-definition, and refactorings.

We can use the syntax command for defining embedded domain-specific languages. In simple cases, we can reuse existing syntactic categories for this but assign them new semantics, such as in the following notation for constructing BoolExpr objects.

```
syntax "`[BExpr|" term "]" : term
macro_rules
 | `(`[BExpr| true]) => `(BoolExpr.val true)
 | `(`[BExpr| false]) => `(BoolExpr.val false)
 | `(`[BExpr| $x:ident]) => `(BoolExpr.var $(quote x.getId.toString))
 | `(`[BExpr| $p ∨ $q]) => `(BoolExpr.or `[BExpr| $p] `[BExpr| $q])
 | `(`[BExpr| ¬ $p]) => `(BoolExpr.not `[BExpr| $p])
#check `[BExpr| p ∨ true]
```
The macro\_rules command above specifies how to convert a subset of the builtin syntax for terms into constructor applications for BoolExpr. The term \$(quote x.getId.toString) converts the identifier x into a string literal.

As a final example, we modify the notation Γ p. In the following version, Γ is not an arbitrary term anymore, but a comma-separated sequence of entries of the form var → value, and the right-hand side is now interpreted as a BoolExpr term by reusing our macro from above.

syntax entry := ident " → " term:max syntax entry,\* "" term : term

```
macro_rules
 | `( $[$xs:ident → $vs:term],*  $p:term ) =>
   let xs := xs.map fun x => quote x.getId.toString
   `(denote (List.toAssocList [$[( $xs , $vs )],*]) `[BExpr| $p])
#eval a → false, b → true  b ∨ a -- true
```
We use the antiquotation splice \$[\$xs:ident → \$vs:term],\* to deconstruct the sequence of entries into two arrays xs and vs containing the variable names and values, respectively, adjust the former array, and combine them again in a second splice.

# 3 The Code Generator

The Lean 4 code generator produces efficient C code. It is useful for building both efficient Lean extensions and standalone applications. The code generator performs many transformations, and many of them are based on techniques used in the Haskell compiler GHC [7]. However, in contrast to Haskell, Lean is a strict language. We control code inlining and specialization using the attributes @[inline] and @[specialize]. They are crucial for eliminating the overhead introduced by the towers of abstractions used in our source code. Before emitting C code, we erase proof terms and convert Lean expressions into an intermediate representation (IR). The IR is a collection of Lean data structures,<sup>12</sup> and users can implement support for backends other than C by writing Lean programs that import Lean.Compiler.IR. Lean 4 also comes with an interpreter for the IR, which allows for rapid incremental development and testing right from inside the editor. Whenever the interpreter calls a function for which native, ahead-of-time compiled code is available, it will switch to that instead, which includes all functions from the standard library. Thus the interpretation overhead is negligible as long as e.g. all expensive tactics are precompiled.

Functional but in-place. Most functional languages rely on garbage collection for automatic memory management. They usually eschew reference counting in favor of a tracing garbage collector, which has less bookkeeping overhead at runtime. On the other hand, having an exact reference count of each value enables optimizations such as destructive updates [14]. When performing functional updates, objects often die just before creating an object of the same kind. We observe a similar phenomenon when we insert a new element into a purely functional data structure, such as binary trees, a theorem prover rewrites formulas, a compiler applies optimizations by transforming abstract syntax trees, or the function simplify defined earlier. We call it the resurrection hypothesis: many objects die just before creating an object of the same kind. The Lean memory manager uses reference counting and takes advantage of this hypothesis, and enables pure code to perform destructive updates in all scenarios described

<sup>12</sup> https://github.com/leanprover/lean4/blob/cade21/src/Lean/Compiler/IR/ Basic.lean

above when objects are not shared. It also allows a novel programming paradigm that we call functional but in-place (FBIP) [10]. Our preliminary experimental results demonstrate our new compiler produces competitive code that often outperforms the code generated by high-performance compilers such as ocamlopt and GHC [14]. As an example, consider the function map f as that applies a function f to each element of a list as. In this example, [] denotes the empty list, and a::as the list with head a followed by the tail as.

```
def map : (α → β) → List α → List β
 | f, [] => []
 | f, a::as => f a :: map f as
```
If the list referenced by as is not shared, the code generated by our compiler does not allocate any memory. Moreover, if as is a nonshared list of list of integers, then map (map inc) as will not allocate any memory either. In contrast to static linearity systems, allocations are also avoided even if only a prefix of the list is not shared. FBIP also allows Lean users to use data structures, such as arrays and hashtables, in pure code without any performance penalty when they are not shared. We believe this is an attractive feature because hashtables are frequently used to implement decision procedures and nontrivial proof automation.

## 4 The User Interface

Our system implements the Language Server Protocol (LSP) using the task abstraction provided by its standard library. The Lean 4 LSP server is incremental and is continuously analyzing the source text and providing semantic information to editors implementing LSP. Our LSP server implements most LSP features found in advanced IDEs, such as hyperlinks, syntax highlighting, type information, error handling, auto-completion, etc. Many editors implement LSP, but VS Code is the preferred editor by the Lean user community. We provide extensions for visualizing the intermediate proof states in interactive tactic blocks, and we want to port the Lean 3 widget library for constructing interactive visualizations for their proofs and programs.

# 5 Conclusion

Lean 4 aims to be a fully extensible interactive theorem prover and functional programming language. It has an expressive logical foundation for writing mathematical specifications and proofs and formally verified programs. Lean 4 provides many new unique features, including a hygienic macro-system, an efficient typeclass resolution procedure based on tabled resolution, efficient code generator, and abstractions for sealing low-level optimizations. The new elaboration procedure is more general and efficient than those implemented in previous versions. Users may also extend and modify the elaborator using Lean itself. Lean has a relatively small trusted kernel, and the rich API allows users to export their developments to other systems and implement their own reference checkers. Lean is an ongoing and long-term effort, and future plans include integration with external SMT solvers and first-order theorem provers, new compiler backends, and porting the Lean 3 Mathematical Library.

Acknowledgments. We are grateful to Marc Huisinga and Wojciech Nawrocki for developing the LSP server, Daniel Selsam for working with us on the new typeclass resolution procedure and interesting design discussions, Daan Leijen, Nikhil Swamy, Sebastian Graf, Simon Peyton Jones, and Max Wagner for advice and design discussions, Joe Hendrix, Andrew Kent, Rob Dockins, and Simon Winwood from Galois Inc for being early Lean 4 adopters, and providing useful feedback and suggestions, and the whole Lean community for all their excitement and pushing Lean forward.

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Harpoon: Mechanizing Metatheory Interactively**

Jacob Errington , Junyoung Jang , and Brigitte Pientka

McGill University, Montreal, Canada

{jacob.errington, junyoung.jang}@mail.mcgill.ca, bpientka@cs.mcgill.ca

**Abstract.** Beluga is a proof checker that provides sophisticated infrastructure for implementing formal systems with the logical framework LF and proving metatheoretic properties as total, recursive functions transforming LF derivations. In this paper, we describe Harpoon, an interactive proof engine built on top of Beluga. It allows users to develop proofs interactively using a small, fixed set of high-level *actions* that safely transform a subgoal. A sequence of actions elaborates into a (partial) *proof script* that serves as an intermediate representation describing an assertion-level proof. Last, a proof script translates into a Beluga program which can be type-checked independently. Harpoon is available on GitHub. We have used Harpoon to replay a wide array of examples covering all features supported by Beluga. In particular, we have used it for normalization proofs, including the recently proposed POPLMark reloaded challenge.

## **1 Introduction**

Mechanizing formal systems and proofs about them plays an important role in establishing trust in programming languages and verifying software systems in general. Key questions in this setting are how to represent variables, (simultaneous) substitutions, assumptions, and derivations that depend on assumptions. Higher-order abstract syntax (HOAS) provides an elegant and unifying answer to these questions, relieving users from having to write boilerplate code.

Beluga is a proof checker with built-in support for HOAS encodings of formal systems based on the logical framework LF [13]. Metatheoretic inductive proofs are implemented as recursive, dependently-typed functions that manipulate and transform HOAS representations [21,4,25]. In this paper, we describe the interactive proof engine Harpoon which is built on top of Beluga. A Harpoon user modularly and incrementally develops a metatheoretic proof by solving independent subgoals via a fixed set of high-level *actions*. An action eliminates the subgoal on which it is executed, filling it with a proof that possibly contains new subgoals to be resolved. The actions we support are: introduction of assumptions, case-analysis, inductive reasoning, and both forward and backward reasoning styles.

© The Author(s) 2021 A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 48, 2021. https://doi.org/10.1007/978-3-030-79876-5 38 636–6

While our fixed set of actions is largely inspired by similar systems such as Twelf [20,28,27] and Abella [11], Harpoon advances the state of the art in interactively developing mechanized proofs about HOAS representations in two ways: 1. We treat subgoals as first-class and characterize them using contextual types that pair their goal types together with the contexts in which they are meaningful; a contextual substitution property guarantees that each step of proof development correctly refines the partial proof under construction [8]. 2. Rather than simply record the sequence of actions given by the user, we elaborate this sequence into an assertion-level proof [15], represented as what we call a *proof script*. The proof script is what we record as output of an interactive session. It can be both typechecked directly and translated into a Beluga program.

We have used Harpoon (see https://beluga-lang.readthedocs.io/) on a wide range of representative examples from the Beluga library: normalization proofs for the simply-typed lambda calculus [6], benchmarks for reasoning about binders [9,10], and the recent POPLMark Reloaded challenge [1]. These examples involve numerous concerns that arise in proof development, and cover all the domainspecific abstractions that Beluga provides. Our experience shows that Harpoon lowers the entry barrier for users: they only need to understand how to represent formal systems and derivations using HOAS encodings and can then manipulate the HOAS representations directly via the high-level actions which correspond closely to how proofs are developed on paper. As such, we believe that Harpoon eases the task of proving metatheoretic statements.

#### **2 Proof Development in Harpoon**

We introduce the main features of Harpoon by interactively developing the proof of two lemmas that play a central role in the proof of weak normalization of the simply-typed lambda calculus. For a more detailed description, see [6].

#### **2.1 Initial setup: encoding the language**

We begin by defining the simply-typed lambda-calculus in the logical framework LF [13] using an intrinsically typed encoding. In typical HOAS style, lambda abstraction takes an LF function representing the abstraction of a term over a variable. There is no case for variables, as they are treated implicitly. We remind the reader that this is a weak, representational function space – there is no case analysis or recursion, so only genuine lambda terms can be represented.

```
LF tp : type =
                             LF tm : tp → type =
                              | lam : (tm T1 → tm T2) → tm (arr T1 T2)
                              | app : tm (arr T1 T2) → tm T1 → tm T2;
```
Free variables such as T1 and T2 are implicitly universally quantified (see [23]) and programmers subsequently do not supply arguments for implicitly quantified parameters when using a constructor.

Next, we define a small-step operational semantics for the language. For simplicity, we use a call-by-name reduction strategy and do not reduce under lambda-abstractions. Note that we use LF application to encode the object-level substitution in the **s\_beta** rule.

```
LF step : tm T → tm T → type =
        → step (app M N) (app M' N)
                                      LF steps : tm T → tm T → type =
                                       | next : step M M' → steps M' N
                                              → steps M N
                                       | refl: steps M M;
```
Using this definition, we define a notion of termination: a term halts if it reduces to a value. This is captured by the constructor **halts/m**.

```
LF val : tm T → type = v_lam: val (lam M);
LF halts : tm T → type = halts/m : val V → steps M V → halts M;
```
#### **2.2 Termination Property:** intros**,** split**,** unbox**, and** solve

As the first short lemma, we show the Termination property: if M' is known to halt and **steps** M M', then M also halts. We start our interactive proof session by loading the signature and defining the name of the theorem and the statement that we want to prove.

```
Name of theorem: halts_step
Statement of theorem: [  step M M'] → [  halts M'] → [  halts M]
```
We pair each LF object such as **step** M M' together with the LF context in which it is meaningful [21,26,19]. We refer to such an object as a *contextual object* and embed contextual types, written as \_ \_ , into Beluga types using the "box" syntax. In this example, the LF context, written on the left of , is empty, as we consider closed LF objects. As before, the free variables M and M' are implicitly quantified at the outside. They themselves stand for contextual objects and have contextual type ( **tm** T). The *theorem statements* are hence *statements about contextual LF objects* and directly correspond to Beluga types.

The proof begins with a single subgoal whose type is simply the statement of the theorem under no assumptions. Since this subgoal has a function type, Harpoon will automatically apply the **intros** action, which introduces assumptions as follows: First, the (implicitly) universally quantified variables M, M' are added to the *meta-context*. This context collects parameters introduced by universal quantifiers. This is in contrast with the *computational context*, which collects assumptions introduced by the simple function space. In particular, the second phase of the **intros** action adds the assumptions s:[ **step** M M'] and h:[ **halts** M'] to the computational context. Observe that since <sup>M</sup> and M' have type **tm** T, **intros** also adds T to the meta-context, although it is implicit in the definitions of **step** and **halts** and is not visible at all in the theorem statement (see the meta-context Fig. 1 step 1).

The proof proceeds by inversion on h. Using the **split** action, we add the two new assumptions S:( **steps** M' M2) and V:( **val** M2) to the meta-context


**Fig. 1.** Interactive session of the proof for the **halts\_step** lemma.

(see Fig. 1, step 1.). To build a proof for [ **halts** M], we need to show that there is a step from M to some value M2. To build such a derivation, we use first the **unbox** action on the computation-level assumption s to obtain an assumption S' in the meta-context which is accessible to the LF layer (inside a box) (see Fig. 1, step 2.). Finally, we can finish the proof by supplying the term [ **halts/m** (**next** S' S) V] with the **solve** action (see Fig. 1, step 3). This is similar to the exact tactic in Coq.

The resulting proof script is given below. Assertions are written in boldface and curly braces denote new scopes, listing the full meta-context and the full computational context. Using an erasure we can then generate a translated program in the external syntax, i.e. the syntax a user would use when implementing the proof directly, rather than the internal syntax. It is hence much more compact than the actual proof script. This program can then be seamlessly combined with hand-written Beluga programs and can also independently type-checked.

```
Theorem halts_step:[  step M M'] → [  halts M'] → [  halts M]
Proof Script Erased program (external syntax)
intros
{T:( -
        tp), M : ( -
                   tm T), M' : ( -
                                 tm T)
        step M M'], h : [ -
                         halts M']
; split h as
 case halts/m:
 {T:( -
         tp), M : ( -
                     tm T), M' : ( -
                                   tm T),
   M2 : ( -
          tm T), S : ( -
                       steps M' M2), V : ( -
                                           val M2)
 |s:[ -
         step M M'], h : [ -
                           halts M']
 ; by s as S' unboxed
 ; solve [ -
           halts/m (next S' S) V]
 }
}
                                                  fn s => fn h =>
                                                    let [ -
                                                          halts/m S V] = h in
                                                    let [ -
                                                          S'] = s in
                                                     [ -
                                                        halts/m (next S' S) V]
```
#### **2.3 Setup continued: reducibility**

We now consider one of the key lemmas in the weak normalization proof, called the backwards closed lemma, i.e. if M' is reducible at some type T and M steps to M', then M is also reducible at T. We begin to define a set of terms *reducible* at a type T. All reducible terms are required to halt, and reducible terms at an arrow type are required to produce reducible output given reducible input. Concretely, a term M is reducible at type (**arr** T1 T2), if for all terms N:**tm** T1 where N is reducible at type T1, then (**app** M N) is reducible at type T2. Reducibility cannot be directly encoded on the LF layer, as it is not merely describing the syntax of an expression or derivation. Instead, we encode the set of reducible terms using the stratified type **Reduce** which is recursively defined on the type T in Beluga (see [16]). Note that we write { } for explicit universal quantification over contextual objects.

```
stratified Reduce : {T : ( tp)} [ tm T] → ctype =
 | Unit: [ halts M] → Reduce [ unit] [ M]
 | Arr : [ halts M]
      → ({N:( tm T1)} Reduce [ T1] [ N] → Reduce [ T2] [ app M N])
      → Reduce [ arr T1 T2] [ M];
```
## **2.4 Backwards Closed Property:** msplit**,** suffices**, and** by

We can now state the backwards closed lemma formally as follows: if M' is reducible at some type T and M steps to M', then M is also reducible at T. We prove this lemma by induction on T. This is specified by referring to the position of the induction variable in the statement.

```
Name of theorem: bwd_closed
Statement of theorem:
  {T : ( tp)} {M : ( tm T)} {M' : ( tm T)}
  [ step M M'] → Reduce [ T] [ M'] → Reduce [ T] [ M]
Induction order: 1
```
After Harpoon automatically introduces the metavariables T, M, and M' together with an assumption s:[ **step** M M'] and r : **Reduce** [ T] [ M'], we use **msplit** T to split the proof into two cases (see Fig. 2, step 1). Whereas **split** case analyzes a Beluga type, **msplit** considers the cases for a (contextual) LF type. In reality, **msplit** is implemented in terms of the **split** action.

The case for T = **unit** is straightforward (see Fig. 2, steps 2 and 3). First, we use the **split** action to invert the premise r : **Reduce** [ **unit**] [ M']. Then, we use the **by** action to invoke the **halts\_step** lemma (see Sec. 2.2) to obtain an assumption h:[ **halts** M]. We **solve** this case by supplying the term **Unit** <sup>h</sup> (see Fig. 2 step 3).

In the case for T = **arr** T1 T2, we begin similarly by inversion on r using the **split** action (see Fig. 3 step 4). We observe that the goal type is **Reduce** [ **arr** T1 T2] [ M], which can be produced by using the **Arr** constructor if we can construct a proof for each of the user-specified types, [ **halts** M] and {N:( **tm** T1)} **Reduce** [ T1] [ N] <sup>→</sup> **Reduce** [ T2] [ **app** M N]. Such *backwards reasoning* is accomplished via the **suffices** action. The user supplies a term representing an implication whose conclusion is compatible with the current goal and proceeds to prove its premises as specified (see Fig.3 step 5).


**Fig. 2.** Backwards Closed Lemma. Step 1: Case analysis of the type T; Steps 2 and 3: Base case (T = **unit**).

To prove the first premise, we apply the **halts\_step** lemma (see Fig. 3 step 6). As for the second premise, Harpoon first automatically introduces the variable N:( **tm** T1) and the assumption r1:**Reduce** [ T1] [ N], so it remains to show **Reduce** [ T2] [ **app** M N]. We deduce r':**Reduce** [ T2] [ **app** M' N] using the assumption rn. Using s:[ **step** M M'], we build a derivation s':[ **step** (**app** M N) (**app** M' N)] using **s\_app**. Finally, we appeal to the induction hypothesis. Using the **by** action, we refer to the recursive call to complete the proof (see Fig. 3 step 7). The resulting proof script (of around 70 lines) can again be translated into a compact program.

Note that Harpoon allows users to use underscores to stand for arguments that are uniquely determined (see Harpoon Proof 3 step 7). We enforce that these underscores stand for uniquely determined objects in order to guarantee that the contexts and the goal type of every subgoal are closed. This ensures modularity: solving one subgoal does not affect any other open subgoals. As a consequence, users are not restricted in their proof development. As they would on paper, users can work on goals in any order, mix forward and backward reasoning, erase wrong parts, and replace them by correct steps.

Using the explained actions, one can now prove the fundamental lemma and the weak normalization theorem. For a more detailled description of this proof in Beluga see [5,6].

**Additional actions.** Harpoon supports some additional features not discussed in this paper; see https://beluga-lang.readthedocs.io/ for a complete list of actions. In general, these actions add no expressive power, but enable more precise expression of a user's intent. For example, the **invert** action splits on the type of a given term, ensuring that there is a unique case to consider. It is implemented simply as the **split** action followed by an additional check.

#### **3 Implementation of Harpoon**

Harpoon is a front end that allows users to construct a proof for a theorem statement represented as a Beluga type. Types in Beluga include universal

```
Step 4 Step 5
Meta-context:
 T1 : (-
        tp)
 T2 : (-
        tp)
 M :(-
        tm (arr T1 T2))
 M' : (-
        tm (arr T1 T2))
Computational context:
 s :[-
        step M M']
 r : Reduce [-
               arr T1 T2][-
                           M']
Reduce [-
        arr T1 T2][-
                     M]
> split r
                                            Meta-context:
                                             T1 : (-
                                                    tp)
                                             T2 : (-
                                                    tp)
                                             M :(-
                                                    tm (arr T1 T2))
                                             M' : (-
                                                    tm (arr T1 T2))
                                            Computational context:
                                             s :[-
                                                    step M M']
                                             rn : {N : ( -
                                                          tm T)}Reduce [-
                                                                         N][-
                                                                              T]
                                                → Reduce [-
                                                            T2][-
                                                                 app M' N]
                                             h' : [-
                                                    halts M']
                                             r : Reduce [-
                                                           arr T1 T2][-
                                                                       M']
                                            Reduce [-
                                                    arr T1 T2][-
                                                                 M]
                                            > suffices by Arr toshow
                                             [-
                                               halts M],
                                             {N : ( -
                                                     tm T1)}Reduce [-
                                                                     T1][-
                                                                           N]
                                               → Reduce [-
                                                           T2][-
                                                                 app M N]
Step 6 Step 7
Meta-context:
 T1 : (-
        tp)
 T2 : (-
        tp)
 M :(-
        tm (arr T1 T2))
 M' : (-
        tm (arr T1 T2))
Computational context:
 s :[-
        step M M']
 rn : {N : ( -
              tm T)} Reduce [-
                              N][-
                                   T]
   → Reduce [-
               T2] [-
                     app M' N]
 h' : [-
        halts M']
 r : Reduce [-
               arr T1 T2][-
                           M']
[-
  halts M]
> by halts_step s h' as h
                                            Meta-context:
                                             T1 : (-
                                                    tp)
                                             T2 : (-
                                                    tp)
                                             M :(-
                                                    tm (arr T1 T2))
                                             M' : (-
                                                    tm (arr T1 T2))
                                             N :(-
                                                    tm T1)
                                            Computational context:
                                             s :[-
                                                    step M M']
                                             rn : {N : ( -
                                                          tm T)} Reduce [-
                                                                          N][-
                                                                               T]
                                                 → Reduce [-
                                                             T2] [-
                                                                   app M' N]
                                             h' : [-
                                                    halts M']
                                             r : Reduce [-
                                                           arr T1 T2][-
                                                                       M']
                                             r1 : Reduce [-
                                                           T1] [-
                                                                  N]
                                            Reduce [-
                                                    T2] [-
                                                           app M N]
                                            > by (rn [-
                                                       N] r1) as r';
                                              unbox s as S;
                                              by (bwd_closed ___[-
                                                                    s_app S] r') as ih
```
**Fig. 3.** Backwards Closed Lemma: Step Case

quantification over contextual types (dependent function space, written with curly braces), implications (simple function space), boxed contextual types, and stratified/recursive types (written as **c** −→*C* where *C* stands for a contextual object). In addition, Beluga supports quantification over LF contexts and even LF substitutions relating two LF contexts. We omit these below for simplicity, although they are also supported in Harpoon. In essence, Beluga types correspond to statements in first-order logic over a domain consisting of contextual objects, LF contexts, and LF substitutions. We can view **c** −→*<sup>C</sup>* and [*<sup>Ψ</sup> <sup>A</sup>*] as atomic propositions.

$$\begin{array}{lll}\text{Types} & \tau ::= \mathbf{c} \; \overrightarrow{C} \; | \; [\Psi \vdash A] \mid \{X \colon (\Psi \vdash A) \} \; \tau \mid \tau\_1 \to \tau\_2 \\\text{Meta-Context} & \Delta ::= \cdot \mid \Delta, X :: (\Psi \vdash A) \\\text{Context} & \Gamma ::= \cdot \mid \Gamma, x \colon \tau \end{array}$$

Users construct a natural deduction proof for a theorem statement where *Γ*, the *computation context*, contains hypotheses introduced from the simple function space and where *Δ*, the *meta-context*, holds parameters introduced from the universal quantifier (curly-brace syntax) or by lifting an assumption [*Ψ A*] from *Γ* (box-elimination rule).

A subgoal in Harpoon is a typed hole in the proof that remains to be filled by the user. Such a hole is represented by a *subgoal variable*, the type of which is a contextual type (*Δ*; *Γ τ* ) that captures the typechecking state at the point the variable occurs [19,3]: it remains to construct a proof for *τ* with the parameters from *Δ* and the assumptions from *Γ*. Subgoal variables in the proof script are collected into a *subgoal context* and substitution of subgoal variables is typepreserving [8]. Interactive actions are implemented with subgoal substitutions, so the correctness of interactive proof refinement is a consequence of the subgoal substitution property. Note that a subgoal's type cannot itself contain subgoals – the subgoal type must be fully determined, so solving one subgoal cannot affect any other subgoal. Furthermore, subgoal variables may be introduced only in positions where we must construct a normal term (written *e*); these are terms that we must *check* against a given type. This given type becomes part of the subgoal's type. Subgoal variables stand thus in contrast with ordinary variables, which are neutral terms (written *i*). (See [14,26,16] for examples of this so-called *bi-directional* characterization of normal and neutral proof terms in Beluga.)

An action is executed on a subgoal to eliminate it, while possibly introducing new subgoals. Actions emphasize the bi-directional nature of interactive proof construction: some demand normal terms *e* and others demand neutral terms *i*. To execute an action, the system synthesizes a proof script fragment from it, and substitutes that fragment for the current subgoal. Any subgoal variables present in the fragment become part of the subgoal context, and the user will have to solve them later. When no subgoals remain, the proof script is closed and can be translated straightforwardly to a Beluga program in internal (fully elaborated) syntax. We employ an erasure to display the program to the user. These are the essential actions for proof development, omitting our so-called "administrative" actions (such as **undo**):

Actions *<sup>α</sup>* ::= intros <sup>|</sup> solve *<sup>e</sup>* <sup>|</sup> by *<sup>i</sup>* as *<sup>x</sup>* <sup>|</sup> unbox *<sup>i</sup>* as *<sup>X</sup>* <sup>|</sup> split *<sup>i</sup>* <sup>|</sup> suffices *<sup>i</sup>* by −→*<sup>τ</sup>*

intros introduces all assumptions from function types in the current goal; solve closes the current subgoal with a given a normal term, introducing no new subgoals. This action trivially makes Harpoon complete, as a full Beluga program could be given via solve to eliminate the initial subgoal of any proof. The action by enables introducing an intermediate result, often from a lemma or an induction hypothesis, demanding a neutral term *i* and binding it to a given name; unbox is the same as by, but it binds the result as a variable in the *metacontext*; split considers a covering set of cases for a neutral term (typically a variable) and generates possible induction hypotheses based on the specified induction order, (for details on coverage, see [24]); suffices allows programmers to reason backwards by supplying a neutral term *i* of function type and the types −→*τ* of arguments to construct for this function.

# **4 Empirical evaluation of Harpoon**

We give a summary of representative case studies that we replayed using Harpoon in Table 1. In porting these proofs to Harpoon, we use **solve** *e* only when *e* is atomic, i.e. it describes either a contextual LF term or a constant applied to all its arguments (either *e* = *M*, *e* = [*C*] or *e* = *c* −→*C e*<sup>1</sup> *...en*). We list in the table the number of commands used to complete the proof and what particular features made the selected case study interesting for testing Harpoon. The first


**Table 1.** Summary of proofs ported to Harpoon from Beluga.

four examples proceed by straightforward induction, but the remaining examples are less direct since they feature logical relations. The STLC strong normalization and algorithmic equality completeness examples are larger developments, totalling 38 and 26 theorems respectively. Crucially, these case studies make use of Beluga's domain-specific abstractions, by splitting on contexts, reasoning about object-language variables, and exploiting the built-in equational theory of substitutions. We have since used Harpoon to replay the meta-theoretic proofs about Standard ML from [18].

This evaluation gives us confidence in the robustness and expressive power of Harpoon.

# **5 Related work**

There are several approaches to specify and reason about formal systems.

Beluga and hence Harpoon belong to the lineage of the Twelf system [20], which also implements the logical framework LF. Metatheoretic proofs in Twelf are implemented as *relations*. Totality checking then ensures that these relations correspond to actual proofs. As Twelf is limited to proving *Π*<sup>1</sup> formulas ("forallexists" statements), normalization proofs using logical relations cannot be directly encoded. Although Harpoon's actions are largely inspired by the internal actions of Twelf's (experimental) fully-automated metatheorem prover [28,27], Harpoon supports user interaction, more expressive theorem statements, and generation of proof witnesses, in the form of both the generated proof script and Beluga program resulting from translation.

The Abella system [11] also provides an interactive theorem prover for reasoning about specifications using HOAS. First, its theoretical basis is quite different from Beluga's: Abella's reasoning logic extends first-order logic with a ∇ quantifier [12] that is used to express properties about variables. Second, Abella's interactive mode provides a fixed set of *tactics*, similar to the actions we describe in this paper. However, these tactics only loosely connect to the actual theoretical foundation of Abella and no proof terms are generated as witnesses by the Abella system.

We can also reason about formal systems in general purpose proof assistants such as Coq. The general philosophy in such systems is that users should be in the position of writing complex domain-specific tactics to facilitate proof construction using languages such as LTac [7] or MTac(2) [29,17]. Although this is an extremely flexible approach, we believe that the tactic-centric view often obscures the actual line of reasoning in the proof. The proofs themselves can often be illegible and incomprehensible. Further, strong static guarantees about interactive proof construction are lacking; for example, *dynamic* checks enforce variable dependencies. In contrast, our goal is to enable mechanized proof development in a style close to that of a proof on paper. Thus we provide a fixed set of tactics suitable for a wide array of proofs, so users can concentrate on proof development instead of tactic development. As such, our work draws inspiration from [2] where the authors describe high-level actions within the tutorial proof checker Tutch. Our work extends and adapts this view to the mechanization of inductive metatheoretic proofs based on HOAS representations.

## **6 Conclusion**

We have presented Harpoon, an interactive command-driven front-end of Beluga for mechanizing meta-theoretic proofs based on high-level actions. The sequence of interactive actions is elaborated into a proof script behind the scenes that represents an assertion-level proof. Last, proof scripts can soundly be translated to Beluga programs. We have evaluated Harpoon on several casestudies, ranging from purely syntactic arguments to proofs by logical relations. Our experience is that Harpoon lowers the entry barrier for users to develop meta-theoretic proofs about HOAS encodings.

In the future, we aim to extend Harpoon with additional high-level actions that support further automation. A natural first step is to support an action trivial which would attempt to automatically close an open sub-goal.

**Acknowledgments.** Jacob Errington and Junyoung Jang acknowledge support from Fonds de Recherche du Qu´ebec – Nature et technologies (FRQNT). Brigitte Pientka acknowledges support from National Science and Engineering Research Council (NSERC).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Author Index**

Aaronson, Scott 468 Alrabbaa, Christian 291

Baader, Franz 291, 309 Barnett, Lee A. 252 Barrett, Clark 148 Bártek, Filip 525 Bartocci, Ezio 565 Baumgartner, Peter 589 Bentkamp, Alexander 378, 396, 415 Bibel, Wolfgang 58 Biere, Armin 252 Blanchette, Jasmin 344, 396, 415 Borgwardt, Stefan 291 Brauße, Franz 113 Bryant, Randal E. 433

Chaudhuri, Kaustuv 200 Ciabattoni, Agata 565 Cimatti, Alessandro 131 Cohen, Liron 3 Cruanes, Simon 415

De Lon, Adrian 614 Desharnais, Martin 450 Dixon, Clare 76 Draheim, Dirk 507

Ebner, Gabriel 344 Echenim, Mnacho 183 Errington, Jacob 636

Fiorentini, Camillo 217 Fleury, Mathias 450

Goli´nska-Pilarek, Joanna 41 Governatori, Guido 565 Griggio, Alberto 131

Haifani, Fajar 327 Han, Jesse Michael 577 Heule, Marijn J. H. 433, 468 Hozzová, Petra 361

Hustadt, Ullrich 76 Huuskonen, Taneli 41

Iosif, Radu 183

Jang, Junyoung 636 Järv, Priit 507

Kim, Dohan 166 Koepke, Peter 614 Koopmann, Patrick 291, 309 Korovin, Konstantin 113 Korovina, Margarita V. 113 Kovács, Laura 361 Kovtunova, Alisa 291 Kriegel, Francesco 309 Krueger, Ryan 577

Li, Liming 485 Lorenzen, Anton 614 Lynch, Christopher 166

Marti, Adrian 614 Moura, Leonardo de 625 Müller, Norbert Th. 113

Nalon, Cláudia 76 Neufeld, Emery 565 Nigam, Vivek 234 Nipkow, Tobias 93 Nummelin, Visa 378, 415 Nuradiansyah, Adrian 309

Papacchini, Fabio 76 Peltier, Nicolas 183 Pientka, Brigitte 636

Rabe, Markus N. 25 Rahmouni, Samar 234 Redondi, Gianluca 131 Reis, Giselle 234 Reynolds, Andrew 148 Ringeissen, Christophe 148 Roßkopf, Simon 93 Ruess, Harald 234

Schurr, Hans-Jörg 450 Schütz, Marcel 614 Selsam, Daniel 577 Sheng, Ying 148 Smallbone, Nicholas 602 Suda, Martin 525, 543 Szegedy, Christian 25

Tammet, Tanel 507 Tinelli, Cesare 148 Tourret, Sophie 327, 344, 378, 396, 415

Ullrich, Sebastian 625

Voronkov, Andrei 361 Vukmirovi´c, Petar 378, 396, 415

Weidenbach, Christoph 327 Wenzel, Makarius 614 Wernhard, Christoph 58

Xu, Runqing 485

Yamada, Akihisa 273 Yolcu, Emre 468

Zawidzki, Michał 41 Zhan, Bohua 485 Zohar, Yoni 148